cshearer commited on
Commit
8b930e7
·
verified ·
1 Parent(s): a6276c5

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +170 -74
  2. gene_embeddings_v4.npz +3 -0
README.md CHANGED
@@ -1,111 +1,207 @@
1
  ---
2
- language:
3
- - en
4
  license: mit
5
- model-index:
6
- - name: Marks-lab/LOL-EVE
7
- results:
8
- - task:
9
- type: text-generation
10
- name: Genomic Sequence Modeling
11
- dataset:
12
- type: promoter_sequences
13
- name: Mammalian Promoter Sequences
14
- metrics:
15
- - type: perplexity
16
- value: 3.3182
17
- name: Validation Perplexity
18
- - task:
19
- type: variant-effect-prediction
20
- name: Promoter Variant Effect Prediction
21
- dataset:
22
- type: eqtl_benchmark
23
- name: Causal eQTL Identification
24
- metrics:
25
- - type: accuracy
26
- value: "State-of-the-art"
27
- name: Benchmark Performance
28
  ---
29
 
30
- # LOL-EVE: Language Of Life across EVolutionary Effects
31
 
32
- ## Model Description
33
-
34
- LOL-EVE is a conditional autoregressive transformer model trained on 14.6 million diverse mammalian promoter sequences. It leverages evolutionary information and proximal genetic context to predict indel variant effects in human promoter regions.
35
-
36
- ## Architecture
37
 
38
- - **Model Type**: Conditional Autoregressive Transformer
39
- - **Base Architecture**: CTRL (Conditional Transformer Language Model)
40
- - **Layers**: 12
41
- - **Embedding Dimension**: 768
42
- - **Attention Heads**: 12
43
- - **Max Sequence Length**: 1007
44
- - **Position Embedding**: adaptive
45
-
46
- ## Training Data
47
 
48
- - **Dataset**: 13.6M mammalian promoter sequences
49
- - **Species Coverage**: Diverse mammalian species
50
- - **Sequence Length**: Up to 1000bp promoter regions
51
- - **Embeddings**: Pre-trained protein embeddings (ESM)
52
 
53
- ## Performance
54
 
55
- The model achieves state-of-the-art performance on three key benchmarks:
56
- 1. **Causal eQTL Identification**: Identifying causal variants in expression quantitative trait loci
57
- 2. **Rare Variant Prioritization**: Prioritizing rare variants in human population data
58
- 3. **TFBS Disruption**: Understanding transcription factor binding site disruptions
 
59
 
60
  ## Usage
61
 
 
 
62
  ```python
63
  from transformers import AutoTokenizer, AutoModelForCausalLM
64
 
65
- # Load tokenizer and model
66
- tokenizer = AutoTokenizer.from_pretrained("Marks-lab/LOL-EVE")
67
- model = AutoModelForCausalLM.from_pretrained("Marks-lab/LOL-EVE")
68
 
69
- # Example sequence
70
- sequence = "ATGCTAGCTAGCTAGCTAGCTA"
71
  inputs = tokenizer(sequence, return_tensors="pt")
 
 
 
 
72
 
73
- # Generate predictions
 
 
 
74
  outputs = model(**inputs)
75
  ```
76
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
  ## Citation
78
 
79
- If you use this model in your research, please cite:
80
 
81
  ```bibtex
82
- @article{loleve2024,
83
- title={LOL-EVE: Predicting Promoter Variant Effects from Evolutionary Sequences},
84
  author={[Authors]},
85
- journal={ICLR 2024},
86
- year={2024}
87
  }
88
  ```
89
 
90
  ## License
91
 
92
- This model is licensed under the MIT License.
93
 
94
- ## Model Details
95
 
96
- - **Training Framework**: PyTorch Lightning
97
- - **Optimizer**: Adam with cosine annealing
98
- - **Learning Rate**: 3e-05
99
- - **Weight Decay**: 0.01
100
- - **Batch Size**: 16
101
- - **Checkpoint**: model_epoch_epoch=01-val_all_control_perplexity_epoch=3.3182.ckpt
102
 
103
- ## Limitations
104
 
105
- - Designed specifically for promoter region analysis
106
- - Requires appropriate genomic context for optimal performance
107
- - Performance may vary across different species and genomic regions
108
 
109
- ## Contact
110
 
111
- For questions about this model, please open an issue in the repository.
 
 
 
 
1
  ---
 
 
2
  license: mit
3
+ tags:
4
+ - genomics
5
+ - dna
6
+ - language-model
7
+ - causal-lm
8
+ - biology
9
+ - sequence-modeling
10
+ - variant-prediction
11
+ - promoter
12
+ - indel
13
+ - eqtl
14
+ pipeline_tag: text-generation
15
+ library_name: transformers
 
 
 
 
 
 
 
 
 
 
16
  ---
17
 
18
+ # LOL-EVE: Language-Optimized Learning for Evolutionary Variant Effects
19
 
20
+ LOL-EVE is a state-of-the-art genomic language model designed for predicting the effects of DNA sequence variants, particularly in promoter regions. It combines pre-trained protein embeddings with a causal language modeling approach to understand the functional impact of genetic variations.
 
 
 
 
21
 
22
+ ## Model Description
 
 
 
 
 
 
 
 
23
 
24
+ LOL-EVE is a transformer-based model that processes DNA sequences with control codes to predict variant effects. The model was trained on 13.6 million mammalian promoter sequences and demonstrates state-of-the-art performance on promoter indel prediction tasks.
 
 
 
25
 
26
+ ### Key Features
27
 
28
+ - **Large vocabulary**: 39,378 tokens including DNA bases, control codes, and special tokens
29
+ - **Control code integration**: Incorporates gene, species, and clade information
30
+ - **Protein context**: Uses pre-trained ESM embeddings for gene-specific understanding
31
+ - **Flexible input format**: Supports both basic DNA sequences and control code sequences
32
+ - **Zero-shot prediction**: Enables prediction of indel effects without task-specific training
33
 
34
  ## Usage
35
 
36
+ ### Basic Usage
37
+
38
  ```python
39
  from transformers import AutoTokenizer, AutoModelForCausalLM
40
 
41
+ # Load model and tokenizer
42
+ tokenizer = AutoTokenizer.from_pretrained('Marks-lab/LOL-EVE')
43
+ model = AutoModelForCausalLM.from_pretrained('Marks-lab/LOL-EVE', trust_remote_code=True)
44
 
45
+ # Basic DNA sequence
46
+ sequence = "[MASK] [MASK] [MASK] [SOS]ATGCTAGCTAGCTAGCTAGCTA[EOS]"
47
  inputs = tokenizer(sequence, return_tensors="pt")
48
+ outputs = model(**inputs)
49
+ ```
50
+
51
+ ### With Control Codes (Recommended)
52
 
53
+ ```python
54
+ # Control code sequence (recommended)
55
+ control_sequence = "brca1 human primate [SOS] ATGCTAGCTAGCTAGCTAGCTA [EOS]"
56
+ inputs = tokenizer(control_sequence, return_tensors="pt")
57
  outputs = model(**inputs)
58
  ```
59
 
60
+ ### Variant Scoring
61
+
62
+ ```python
63
+ import pandas as pd
64
+ import torch
65
+
66
+ def score_variants_hf(variants_df, gene, species, clade):
67
+ """
68
+ Score variants using the Hugging Face model.
69
+
70
+ Args:
71
+ variants_df: DataFrame with columns ['sequence', 'variant_sequence']
72
+ gene: Gene name (e.g., 'brca1')
73
+ species: Species name (e.g., 'human')
74
+ clade: Clade information (e.g., 'primate')
75
+
76
+ Returns:
77
+ DataFrame with added 'score' column
78
+ """
79
+ scores = []
80
+
81
+ for _, row in variants_df.iterrows():
82
+ # Create control code sequences
83
+ ref_seq = f"{gene} {species} {clade} [SOS] {row['sequence']} [EOS]"
84
+ var_seq = f"{gene} {species} {clade} [SOS] {row['variant_sequence']} [EOS]"
85
+
86
+ # Tokenize sequences
87
+ ref_inputs = tokenizer(ref_seq, return_tensors="pt")
88
+ var_inputs = tokenizer(var_seq, return_tensors="pt")
89
+
90
+ # Get model outputs
91
+ with torch.no_grad():
92
+ ref_outputs = model(**ref_inputs)
93
+ var_outputs = model(**var_inputs)
94
+
95
+ # Calculate log-likelihood scores
96
+ ref_logits = ref_outputs.logits[0, :-1] # Exclude last token
97
+ var_logits = var_outputs.logits[0, :-1]
98
+
99
+ ref_tokens = ref_inputs['input_ids'][0, 1:] # Exclude first token
100
+ var_tokens = var_inputs['input_ids'][0, 1:]
101
+
102
+ # Calculate sequence likelihood
103
+ ref_score = torch.nn.functional.cross_entropy(ref_logits, ref_tokens, reduction='sum')
104
+ var_score = torch.nn.functional.cross_entropy(var_logits, var_tokens, reduction='sum')
105
+
106
+ # Score is the difference (higher = more deleterious)
107
+ score = (var_score - ref_score).item()
108
+ scores.append(score)
109
+
110
+ variants_df['score'] = scores
111
+ return variants_df
112
+
113
+ # Example usage
114
+ variants = pd.DataFrame({
115
+ 'sequence': ['ATGCTAGCTAGCTAGCTAGCTA', 'ATGCTAGCTAGCTAGCTAGCTA'],
116
+ 'variant_sequence': ['ATGCTAGCTAGCTAGCTAGCTA', 'ATGCTAGCTAGCTAGCTAGCTA'] # Example variants
117
+ })
118
+
119
+ scored_variants = score_variants_hf(variants, gene='brca1', species='human', clade='primate')
120
+ print(scored_variants)
121
+ ```
122
+
123
+ ### Input Format
124
+
125
+ The model expects sequences in the format:
126
+ ```
127
+ gene species clade [SOS] sequence [EOS]
128
+ ```
129
+
130
+ Where:
131
+ - `gene`: Gene name (e.g., "brca1", "tp53")
132
+ - `species`: Species name (e.g., "human", "mouse")
133
+ - `clade`: Clade information (e.g., "primate", "mammal")
134
+ - `[SOS]`: Start of sequence token
135
+ - `sequence`: DNA sequence (A, T, G, C)
136
+ - `[EOS]`: End of sequence token
137
+
138
+ ## Model Architecture
139
+
140
+ - **Model type**: Causal Language Model (CTRL-based)
141
+ - **Layers**: 12 transformer layers
142
+ - **Hidden size**: 768 dimensions
143
+ - **Attention heads**: 12
144
+ - **Vocabulary size**: 39,378 tokens
145
+ - **Max sequence length**: 1,007 tokens
146
+ - **Position embeddings**: Adaptive local position embeddings
147
+
148
+ ## Training Data
149
+
150
+ The model was trained on genomic sequences with:
151
+ - DNA sequences up to 1000 base pairs
152
+ - Gene-specific control codes
153
+ - Species and clade information
154
+ - Pre-trained ESM protein embeddings
155
+ - 13.6 million mammalian promoter sequences
156
+
157
+ ## Performance
158
+
159
+ LOL-EVE demonstrates state-of-the-art performance on:
160
+
161
+ ### Benchmarks
162
+ - **Ultra-rare variant prioritization**: Prioritizing ultra-rare variants in gnomAD
163
+ - **Causal eQTL identification**: Identifying causal expression quantitative trait loci
164
+ - **Transcription factor binding site disruption**: Analyzing TFBS disruption by indels
165
+
166
+ ### Key Results
167
+ - Superior performance compared to existing methods for promoter indel prediction
168
+ - Effective zero-shot prediction without task-specific training
169
+ - Strong cross-species generalization capabilities
170
+
171
+ ## Datasets
172
+
173
+ - **[LOL-EVE-UltraRare](https://huggingface.co/datasets/Marks-lab/LOL-EVE-UltraRare)** - Ultra-rare variant benchmark dataset
174
+ - **[LOL-EVE-eQTL_benchmark](https://huggingface.co/datasets/Marks-lab/LOL-EVE-eQTL_benchmark)** - eQTL benchmark dataset
175
+
176
  ## Citation
177
 
178
+ If you use LOL-EVE in your research, please cite:
179
 
180
  ```bibtex
181
+ @article{loleve2025,
182
+ title={A Genomic Language Model for Zero-Shot Prediction of Promoter Variant Effects},
183
  author={[Authors]},
184
+ journal={MLCB 2025},
185
+ year={2025}
186
  }
187
  ```
188
 
189
  ## License
190
 
191
+ This model is released under the MIT License. See the [LICENSE](https://github.com/Marks-lab/LOL-EVE/blob/main/LICENSE) file for more details.
192
 
193
+ ## Repository
194
 
195
+ - **GitHub**: [https://github.com/Marks-lab/LOL-EVE](https://github.com/Marks-lab/LOL-EVE)
196
+ - **Paper**: [MLCB 2025](https://github.com/Marks-lab/LOL-EVE) (link to be updated)
 
 
 
 
197
 
198
+ ## Contact
199
 
200
+ For questions or issues, please contact [your-[email protected]] or open an issue on the [GitHub repository](https://github.com/Marks-lab/LOL-EVE).
 
 
201
 
202
+ ## Acknowledgments
203
 
204
+ - Built on the Hugging Face Transformers library
205
+ - Uses ESM protein embeddings for gene context
206
+ - Inspired by recent advances in genomic language modeling
207
+ - Trained on mammalian promoter sequences from multiple species
gene_embeddings_v4.npz ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:682e2c17b7b7709a9ea23283d691dcc155eb0ca24e4eda616d3e18011c693586
3
+ size 95669748