README.md · Marks-lab/LOL-EVE at main

File size: 6,463 Bytes

---
license: mit
tags:
- genomics
- dna
- language-model
- causal-lm
- biology
- sequence-modeling
- variant-prediction
- promoter
- indel
- eqtl
pipeline_tag: text-generation
library_name: transformers
---

# LOL-EVE: A Genomic Language Model for Zero-Shot Prediction of Promoter Variant Effects

## Model Description

LOL-EVE is a transformer-based model that processes DNA sequences with control codes to predict variant effects. The model was trained on 13.6 million mammalian promoter sequences and demonstrates state-of-the-art performance on promoter indel prediction tasks.

### Key Features

- **Large vocabulary**: 39,378 tokens including DNA bases, control codes, and special tokens
- **Control code integration**: Incorporates gene, species, and clade information
- **Protein context**: Uses pre-trained ESM embeddings for gene-specific understanding
- **Flexible input format**: Supports both basic DNA sequences and control code sequences
- **Zero-shot prediction**: Enables prediction of indel effects without task-specific training

## Usage

### Basic Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('Marks-lab/LOL-EVE')
model = AutoModelForCausalLM.from_pretrained('Marks-lab/LOL-EVE', trust_remote_code=True)

# Basic DNA sequence
sequence = "[MASK] [MASK] [MASK] [SOS]ATGCTAGCTAGCTAGCTAGCTA[EOS]"
inputs = tokenizer(sequence, return_tensors="pt")
outputs = model(**inputs)
```

### With Control Codes (Recommended)

```python
# Control code sequence (recommended)
control_sequence = "brca1 human primate [SOS] ATGCTAGCTAGCTAGCTAGCTA [EOS]"
inputs = tokenizer(control_sequence, return_tensors="pt")
outputs = model(**inputs)
```

### Variant Scoring

```python
import pandas as pd
import torch

def score_variants_hf(variants_df, gene, species, clade):
    """
    Score variants using the Hugging Face model.
    
    Args:
        variants_df: DataFrame with columns ['sequence', 'variant_sequence']
        gene: Gene name (e.g., 'brca1')
        species: Species name (e.g., 'human')
        clade: Clade information (e.g., 'primate')
    
    Returns:
        DataFrame with added 'score' column
    """
    scores = []
    
    for _, row in variants_df.iterrows():
        # Create control code sequences
        ref_seq = f"{gene} {species} {clade} [SOS] {row['sequence']} [EOS]"
        var_seq = f"{gene} {species} {clade} [SOS] {row['variant_sequence']} [EOS]"
        
        # Tokenize sequences
        ref_inputs = tokenizer(ref_seq, return_tensors="pt")
        var_inputs = tokenizer(var_seq, return_tensors="pt")
        
        # Get model outputs
        with torch.no_grad():
            ref_outputs = model(**ref_inputs)
            var_outputs = model(**var_inputs)
            
            # Calculate log-likelihood scores
            ref_logits = ref_outputs.logits[0, :-1]  # Exclude last token
            var_logits = var_outputs.logits[0, :-1]
            
            ref_tokens = ref_inputs['input_ids'][0, 1:]  # Exclude first token
            var_tokens = var_inputs['input_ids'][0, 1:]
            
            # Calculate sequence likelihood
            ref_score = torch.nn.functional.cross_entropy(ref_logits, ref_tokens, reduction='sum')
            var_score = torch.nn.functional.cross_entropy(var_logits, var_tokens, reduction='sum')
            
            # Score is the difference (higher = more deleterious)
            score = (var_score - ref_score).item()
            scores.append(score)
    
    variants_df['score'] = scores
    return variants_df

# Example usage
variants = pd.DataFrame({
    'sequence': ['ATGCTAGCTAGCTAGCTAGCTA', 'ATGCTAGCTAGCTAGCTAGCTA'],
    'variant_sequence': ['ATGCTAGCTAGCTAGCTAGCTA', 'ATGCTAGCTAGCTAGCTAGCTA']  # Example variants
})

scored_variants = score_variants_hf(variants, gene='brca1', species='human', clade='primate')
print(scored_variants)
```

### Input Format

The model expects sequences in the format:
```
gene species clade [SOS] sequence [EOS]
```

Where:
- `gene`: Gene name (e.g., "brca1", "tp53")
- `species`: Species name (e.g., "human", "mouse")
- `clade`: Clade information (e.g., "primate", "mammal")
- `[SOS]`: Start of sequence token
- `sequence`: DNA sequence (A, T, G, C)
- `[EOS]`: End of sequence token

## Model Architecture

- **Model type**: Causal Language Model (CTRL-based)
- **Layers**: 12 transformer layers
- **Hidden size**: 768 dimensions
- **Attention heads**: 12
- **Vocabulary size**: 39,378 tokens
- **Max sequence length**: 1,007 tokens
- **Position embeddings**: Adaptive local position embeddings

## Training Data

The model was trained on genomic sequences with:
- DNA sequences up to 1000 base pairs
- Gene-specific control codes
- Species and clade information
- Pre-trained ESM protein embeddings
- 13.6 million mammalian promoter sequences

## Performance

LOL-EVE demonstrates state-of-the-art performance on:

### Benchmarks
- **Ultra-rare variant prioritization**: Prioritizing ultra-rare variants in gnomAD
- **Causal eQTL identification**: Identifying causal expression quantitative trait loci
- **Transcription factor binding site disruption**: Analyzing TFBS disruption by indels


## Datasets

- **[LOL-EVE-UltraRare](https://huggingface.co/datasets/Marks-lab/LOL-EVE-UltraRare)** - Ultra-rare variant benchmark dataset
- **[LOL-EVE-eQTL_benchmark](https://huggingface.co/datasets/Marks-lab/LOL-EVE-eQTL_benchmark)** - eQTL benchmark dataset
- **[PromoterZoo Training Data](https://huggingface.co/datasets/Marks-lab/PromoterZoo/blob/main/README.md)** - PromoterZoo Training Data

## Citation

If you use LOL-EVE in your research, please cite:

```bibtex
@article{loleve2025,
  title={A Genomic Language Model for Zero-Shot Prediction of Promoter Variant Effects},
  author={[Authors]},
  journal={MLCB 2025},
  year={2025}
}
```

## License

This model is released under the MIT License. See the [LICENSE](https://github.com/Marks-lab/LOL-EVE/blob/main/LICENSE) file for more details.

## Repository

- **GitHub**: [https://github.com/debbiemarkslab/LOL-EVE](https://github.com/debbiemarkslab/LOL-EVE)
- **Paper**: [MLCB 2025 version coming end of month!](https://www.biorxiv.org/content/10.1101/2024.11.11.623015v1) (link to be updated)

## Contact

For questions or issues, please contact [[email protected]] or open an issue on the [GitHub repository](https://github.com/Marks-lab/LOL-EVE).