File size: 6,463 Bytes
1762bf2 8b930e7 1762bf2 0ba9379 1762bf2 8b930e7 1762bf2 8b930e7 1762bf2 8b930e7 1762bf2 8b930e7 1762bf2 8371fea 1762bf2 8b930e7 1762bf2 8371fea 1762bf2 8b930e7 1762bf2 8b930e7 8371fea 8b930e7 a126e71 8b930e7 8371fea 1762bf2 8b930e7 397bbb8 8b930e7 1762bf2 8b930e7 1762bf2 8b930e7 7c5ed6b 8b930e7 1762bf2 8b930e7 1762bf2 8b930e7 1762bf2 ac894f0 1762bf2 8b930e7 1762bf2 8b930e7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 |
---
license: mit
tags:
- genomics
- dna
- language-model
- causal-lm
- biology
- sequence-modeling
- variant-prediction
- promoter
- indel
- eqtl
pipeline_tag: text-generation
library_name: transformers
---
# LOL-EVE: A Genomic Language Model for Zero-Shot Prediction of Promoter Variant Effects
## Model Description
LOL-EVE is a transformer-based model that processes DNA sequences with control codes to predict variant effects. The model was trained on 13.6 million mammalian promoter sequences and demonstrates state-of-the-art performance on promoter indel prediction tasks.
### Key Features
- **Large vocabulary**: 39,378 tokens including DNA bases, control codes, and special tokens
- **Control code integration**: Incorporates gene, species, and clade information
- **Protein context**: Uses pre-trained ESM embeddings for gene-specific understanding
- **Flexible input format**: Supports both basic DNA sequences and control code sequences
- **Zero-shot prediction**: Enables prediction of indel effects without task-specific training
## Usage
### Basic Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('Marks-lab/LOL-EVE')
model = AutoModelForCausalLM.from_pretrained('Marks-lab/LOL-EVE', trust_remote_code=True)
# Basic DNA sequence
sequence = "[MASK] [MASK] [MASK] [SOS]ATGCTAGCTAGCTAGCTAGCTA[EOS]"
inputs = tokenizer(sequence, return_tensors="pt")
outputs = model(**inputs)
```
### With Control Codes (Recommended)
```python
# Control code sequence (recommended)
control_sequence = "brca1 human primate [SOS] ATGCTAGCTAGCTAGCTAGCTA [EOS]"
inputs = tokenizer(control_sequence, return_tensors="pt")
outputs = model(**inputs)
```
### Variant Scoring
```python
import pandas as pd
import torch
def score_variants_hf(variants_df, gene, species, clade):
"""
Score variants using the Hugging Face model.
Args:
variants_df: DataFrame with columns ['sequence', 'variant_sequence']
gene: Gene name (e.g., 'brca1')
species: Species name (e.g., 'human')
clade: Clade information (e.g., 'primate')
Returns:
DataFrame with added 'score' column
"""
scores = []
for _, row in variants_df.iterrows():
# Create control code sequences
ref_seq = f"{gene} {species} {clade} [SOS] {row['sequence']} [EOS]"
var_seq = f"{gene} {species} {clade} [SOS] {row['variant_sequence']} [EOS]"
# Tokenize sequences
ref_inputs = tokenizer(ref_seq, return_tensors="pt")
var_inputs = tokenizer(var_seq, return_tensors="pt")
# Get model outputs
with torch.no_grad():
ref_outputs = model(**ref_inputs)
var_outputs = model(**var_inputs)
# Calculate log-likelihood scores
ref_logits = ref_outputs.logits[0, :-1] # Exclude last token
var_logits = var_outputs.logits[0, :-1]
ref_tokens = ref_inputs['input_ids'][0, 1:] # Exclude first token
var_tokens = var_inputs['input_ids'][0, 1:]
# Calculate sequence likelihood
ref_score = torch.nn.functional.cross_entropy(ref_logits, ref_tokens, reduction='sum')
var_score = torch.nn.functional.cross_entropy(var_logits, var_tokens, reduction='sum')
# Score is the difference (higher = more deleterious)
score = (var_score - ref_score).item()
scores.append(score)
variants_df['score'] = scores
return variants_df
# Example usage
variants = pd.DataFrame({
'sequence': ['ATGCTAGCTAGCTAGCTAGCTA', 'ATGCTAGCTAGCTAGCTAGCTA'],
'variant_sequence': ['ATGCTAGCTAGCTAGCTAGCTA', 'ATGCTAGCTAGCTAGCTAGCTA'] # Example variants
})
scored_variants = score_variants_hf(variants, gene='brca1', species='human', clade='primate')
print(scored_variants)
```
### Input Format
The model expects sequences in the format:
```
gene species clade [SOS] sequence [EOS]
```
Where:
- `gene`: Gene name (e.g., "brca1", "tp53")
- `species`: Species name (e.g., "human", "mouse")
- `clade`: Clade information (e.g., "primate", "mammal")
- `[SOS]`: Start of sequence token
- `sequence`: DNA sequence (A, T, G, C)
- `[EOS]`: End of sequence token
## Model Architecture
- **Model type**: Causal Language Model (CTRL-based)
- **Layers**: 12 transformer layers
- **Hidden size**: 768 dimensions
- **Attention heads**: 12
- **Vocabulary size**: 39,378 tokens
- **Max sequence length**: 1,007 tokens
- **Position embeddings**: Adaptive local position embeddings
## Training Data
The model was trained on genomic sequences with:
- DNA sequences up to 1000 base pairs
- Gene-specific control codes
- Species and clade information
- Pre-trained ESM protein embeddings
- 13.6 million mammalian promoter sequences
## Performance
LOL-EVE demonstrates state-of-the-art performance on:
### Benchmarks
- **Ultra-rare variant prioritization**: Prioritizing ultra-rare variants in gnomAD
- **Causal eQTL identification**: Identifying causal expression quantitative trait loci
- **Transcription factor binding site disruption**: Analyzing TFBS disruption by indels
## Datasets
- **[LOL-EVE-UltraRare](https://huggingface.co/datasets/Marks-lab/LOL-EVE-UltraRare)** - Ultra-rare variant benchmark dataset
- **[LOL-EVE-eQTL_benchmark](https://huggingface.co/datasets/Marks-lab/LOL-EVE-eQTL_benchmark)** - eQTL benchmark dataset
- **[PromoterZoo Training Data](https://huggingface.co/datasets/Marks-lab/PromoterZoo/blob/main/README.md)** - PromoterZoo Training Data
## Citation
If you use LOL-EVE in your research, please cite:
```bibtex
@article{loleve2025,
title={A Genomic Language Model for Zero-Shot Prediction of Promoter Variant Effects},
author={[Authors]},
journal={MLCB 2025},
year={2025}
}
```
## License
This model is released under the MIT License. See the [LICENSE](https://github.com/Marks-lab/LOL-EVE/blob/main/LICENSE) file for more details.
## Repository
- **GitHub**: [https://github.com/debbiemarkslab/LOL-EVE](https://github.com/debbiemarkslab/LOL-EVE)
- **Paper**: [MLCB 2025 version coming end of month!](https://www.biorxiv.org/content/10.1101/2024.11.11.623015v1) (link to be updated)
## Contact
For questions or issues, please contact [[email protected]] or open an issue on the [GitHub repository](https://github.com/Marks-lab/LOL-EVE).
|