File size: 6,463 Bytes
1762bf2
 
8b930e7
 
 
 
 
 
 
 
 
 
 
 
 
1762bf2
 
0ba9379
1762bf2
8b930e7
1762bf2
8b930e7
1762bf2
8b930e7
1762bf2
8b930e7
 
 
 
 
1762bf2
8371fea
1762bf2
8b930e7
 
1762bf2
8371fea
1762bf2
8b930e7
 
 
1762bf2
8b930e7
 
8371fea
8b930e7
 
 
 
a126e71
8b930e7
 
 
 
8371fea
1762bf2
 
8b930e7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
397bbb8
8b930e7
1762bf2
 
8b930e7
1762bf2
 
8b930e7
 
7c5ed6b
8b930e7
 
1762bf2
 
 
 
 
8b930e7
1762bf2
8b930e7
1762bf2
ac894f0
 
1762bf2
8b930e7
1762bf2
8b930e7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
---
license: mit
tags:
- genomics
- dna
- language-model
- causal-lm
- biology
- sequence-modeling
- variant-prediction
- promoter
- indel
- eqtl
pipeline_tag: text-generation
library_name: transformers
---

# LOL-EVE: A Genomic Language Model for Zero-Shot Prediction of Promoter Variant Effects

## Model Description

LOL-EVE is a transformer-based model that processes DNA sequences with control codes to predict variant effects. The model was trained on 13.6 million mammalian promoter sequences and demonstrates state-of-the-art performance on promoter indel prediction tasks.

### Key Features

- **Large vocabulary**: 39,378 tokens including DNA bases, control codes, and special tokens
- **Control code integration**: Incorporates gene, species, and clade information
- **Protein context**: Uses pre-trained ESM embeddings for gene-specific understanding
- **Flexible input format**: Supports both basic DNA sequences and control code sequences
- **Zero-shot prediction**: Enables prediction of indel effects without task-specific training

## Usage

### Basic Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('Marks-lab/LOL-EVE')
model = AutoModelForCausalLM.from_pretrained('Marks-lab/LOL-EVE', trust_remote_code=True)

# Basic DNA sequence
sequence = "[MASK] [MASK] [MASK] [SOS]ATGCTAGCTAGCTAGCTAGCTA[EOS]"
inputs = tokenizer(sequence, return_tensors="pt")
outputs = model(**inputs)
```

### With Control Codes (Recommended)

```python
# Control code sequence (recommended)
control_sequence = "brca1 human primate [SOS] ATGCTAGCTAGCTAGCTAGCTA [EOS]"
inputs = tokenizer(control_sequence, return_tensors="pt")
outputs = model(**inputs)
```

### Variant Scoring

```python
import pandas as pd
import torch

def score_variants_hf(variants_df, gene, species, clade):
    """
    Score variants using the Hugging Face model.
    
    Args:
        variants_df: DataFrame with columns ['sequence', 'variant_sequence']
        gene: Gene name (e.g., 'brca1')
        species: Species name (e.g., 'human')
        clade: Clade information (e.g., 'primate')
    
    Returns:
        DataFrame with added 'score' column
    """
    scores = []
    
    for _, row in variants_df.iterrows():
        # Create control code sequences
        ref_seq = f"{gene} {species} {clade} [SOS] {row['sequence']} [EOS]"
        var_seq = f"{gene} {species} {clade} [SOS] {row['variant_sequence']} [EOS]"
        
        # Tokenize sequences
        ref_inputs = tokenizer(ref_seq, return_tensors="pt")
        var_inputs = tokenizer(var_seq, return_tensors="pt")
        
        # Get model outputs
        with torch.no_grad():
            ref_outputs = model(**ref_inputs)
            var_outputs = model(**var_inputs)
            
            # Calculate log-likelihood scores
            ref_logits = ref_outputs.logits[0, :-1]  # Exclude last token
            var_logits = var_outputs.logits[0, :-1]
            
            ref_tokens = ref_inputs['input_ids'][0, 1:]  # Exclude first token
            var_tokens = var_inputs['input_ids'][0, 1:]
            
            # Calculate sequence likelihood
            ref_score = torch.nn.functional.cross_entropy(ref_logits, ref_tokens, reduction='sum')
            var_score = torch.nn.functional.cross_entropy(var_logits, var_tokens, reduction='sum')
            
            # Score is the difference (higher = more deleterious)
            score = (var_score - ref_score).item()
            scores.append(score)
    
    variants_df['score'] = scores
    return variants_df

# Example usage
variants = pd.DataFrame({
    'sequence': ['ATGCTAGCTAGCTAGCTAGCTA', 'ATGCTAGCTAGCTAGCTAGCTA'],
    'variant_sequence': ['ATGCTAGCTAGCTAGCTAGCTA', 'ATGCTAGCTAGCTAGCTAGCTA']  # Example variants
})

scored_variants = score_variants_hf(variants, gene='brca1', species='human', clade='primate')
print(scored_variants)
```

### Input Format

The model expects sequences in the format:
```
gene species clade [SOS] sequence [EOS]
```

Where:
- `gene`: Gene name (e.g., "brca1", "tp53")
- `species`: Species name (e.g., "human", "mouse")
- `clade`: Clade information (e.g., "primate", "mammal")
- `[SOS]`: Start of sequence token
- `sequence`: DNA sequence (A, T, G, C)
- `[EOS]`: End of sequence token

## Model Architecture

- **Model type**: Causal Language Model (CTRL-based)
- **Layers**: 12 transformer layers
- **Hidden size**: 768 dimensions
- **Attention heads**: 12
- **Vocabulary size**: 39,378 tokens
- **Max sequence length**: 1,007 tokens
- **Position embeddings**: Adaptive local position embeddings

## Training Data

The model was trained on genomic sequences with:
- DNA sequences up to 1000 base pairs
- Gene-specific control codes
- Species and clade information
- Pre-trained ESM protein embeddings
- 13.6 million mammalian promoter sequences

## Performance

LOL-EVE demonstrates state-of-the-art performance on:

### Benchmarks
- **Ultra-rare variant prioritization**: Prioritizing ultra-rare variants in gnomAD
- **Causal eQTL identification**: Identifying causal expression quantitative trait loci
- **Transcription factor binding site disruption**: Analyzing TFBS disruption by indels


## Datasets

- **[LOL-EVE-UltraRare](https://huggingface.co/datasets/Marks-lab/LOL-EVE-UltraRare)** - Ultra-rare variant benchmark dataset
- **[LOL-EVE-eQTL_benchmark](https://huggingface.co/datasets/Marks-lab/LOL-EVE-eQTL_benchmark)** - eQTL benchmark dataset
- **[PromoterZoo Training Data](https://huggingface.co/datasets/Marks-lab/PromoterZoo/blob/main/README.md)** - PromoterZoo Training Data

## Citation

If you use LOL-EVE in your research, please cite:

```bibtex
@article{loleve2025,
  title={A Genomic Language Model for Zero-Shot Prediction of Promoter Variant Effects},
  author={[Authors]},
  journal={MLCB 2025},
  year={2025}
}
```

## License

This model is released under the MIT License. See the [LICENSE](https://github.com/Marks-lab/LOL-EVE/blob/main/LICENSE) file for more details.

## Repository

- **GitHub**: [https://github.com/debbiemarkslab/LOL-EVE](https://github.com/debbiemarkslab/LOL-EVE)
- **Paper**: [MLCB 2025 version coming end of month!](https://www.biorxiv.org/content/10.1101/2024.11.11.623015v1) (link to be updated)

## Contact

For questions or issues, please contact [[email protected]] or open an issue on the [GitHub repository](https://github.com/Marks-lab/LOL-EVE).