Upload folder using huggingface_hub

Browse files

Files changed (2) hide show

README.md +170 -74
gene_embeddings_v4.npz +3 -0

README.md CHANGED Viewed

@@ -1,111 +1,207 @@
 ---
-language:
-- en
 license: mit
-model-index:
-- name: Marks-lab/LOL-EVE
-  results:
-  - task:
-      type: text-generation
-      name: Genomic Sequence Modeling
-    dataset:
-      type: promoter_sequences
-      name: Mammalian Promoter Sequences
-    metrics:
-      - type: perplexity
-        value: 3.3182
-        name: Validation Perplexity
-  - task:
-      type: variant-effect-prediction
-      name: Promoter Variant Effect Prediction
-    dataset:
-      type: eqtl_benchmark
-      name: Causal eQTL Identification
-    metrics:
-      - type: accuracy
-        value: "State-of-the-art"
-        name: Benchmark Performance
 ---
-# LOL-EVE: Language Of Life across EVolutionary Effects
-## Model Description
-LOL-EVE is a conditional autoregressive transformer model trained on 14.6 million diverse mammalian promoter sequences. It leverages evolutionary information and proximal genetic context to predict indel variant effects in human promoter regions.
-## Architecture
-- **Model Type**: Conditional Autoregressive Transformer
-- **Base Architecture**: CTRL (Conditional Transformer Language Model)
-- **Layers**: 12
-- **Embedding Dimension**: 768
-- **Attention Heads**: 12
-- **Max Sequence Length**: 1007
-- **Position Embedding**: adaptive
-## Training Data
-- **Dataset**: 13.6M mammalian promoter sequences
-- **Species Coverage**: Diverse mammalian species
-- **Sequence Length**: Up to 1000bp promoter regions
-- **Embeddings**: Pre-trained protein embeddings (ESM)
-## Performance
-The model achieves state-of-the-art performance on three key benchmarks:
-1. **Causal eQTL Identification**: Identifying causal variants in expression quantitative trait loci
-2. **Rare Variant Prioritization**: Prioritizing rare variants in human population data
-3. **TFBS Disruption**: Understanding transcription factor binding site disruptions
 ## Usage
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
-# Load tokenizer and model
-tokenizer = AutoTokenizer.from_pretrained("Marks-lab/LOL-EVE")
-model = AutoModelForCausalLM.from_pretrained("Marks-lab/LOL-EVE")
-# Example sequence
-sequence = "ATGCTAGCTAGCTAGCTAGCTA"
 inputs = tokenizer(sequence, return_tensors="pt")
-# Generate predictions
 outputs = model(**inputs)
 ```
 ## Citation
-If you use this model in your research, please cite:
 ```bibtex
-@article{loleve2024,
-  title={LOL-EVE: Predicting Promoter Variant Effects from Evolutionary Sequences},
   author={[Authors]},
-  journal={ICLR 2024},
-  year={2024}
 }
 ```
 ## License
-This model is licensed under the MIT License.
-## Model Details
-- **Training Framework**: PyTorch Lightning
-- **Optimizer**: Adam with cosine annealing
-- **Learning Rate**: 3e-05
-- **Weight Decay**: 0.01
-- **Batch Size**: 16
-- **Checkpoint**: model_epoch_epoch=01-val_all_control_perplexity_epoch=3.3182.ckpt
-## Limitations
-- Designed specifically for promoter region analysis
-- Requires appropriate genomic context for optimal performance
-- Performance may vary across different species and genomic regions
-## Contact
-For questions about this model, please open an issue in the repository.

 ---
 license: mit
+tags:
+- genomics
+- dna
+- language-model
+- causal-lm
+- biology
+- sequence-modeling
+- variant-prediction
+- promoter
+- indel
+- eqtl
+pipeline_tag: text-generation
+library_name: transformers
 ---
+# LOL-EVE: Language-Optimized Learning for Evolutionary Variant Effects
+LOL-EVE is a state-of-the-art genomic language model designed for predicting the effects of DNA sequence variants, particularly in promoter regions. It combines pre-trained protein embeddings with a causal language modeling approach to understand the functional impact of genetic variations.
+## Model Description
+LOL-EVE is a transformer-based model that processes DNA sequences with control codes to predict variant effects. The model was trained on 13.6 million mammalian promoter sequences and demonstrates state-of-the-art performance on promoter indel prediction tasks.
+### Key Features
+- **Large vocabulary**: 39,378 tokens including DNA bases, control codes, and special tokens
+- **Control code integration**: Incorporates gene, species, and clade information
+- **Protein context**: Uses pre-trained ESM embeddings for gene-specific understanding
+- **Flexible input format**: Supports both basic DNA sequences and control code sequences
+- **Zero-shot prediction**: Enables prediction of indel effects without task-specific training
 ## Usage
+### Basic Usage
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
+# Load model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained('Marks-lab/LOL-EVE')
+model = AutoModelForCausalLM.from_pretrained('Marks-lab/LOL-EVE', trust_remote_code=True)
+# Basic DNA sequence
+sequence = "[MASK] [MASK] [MASK] [SOS]ATGCTAGCTAGCTAGCTAGCTA[EOS]"
 inputs = tokenizer(sequence, return_tensors="pt")
+outputs = model(**inputs)
+```
+### With Control Codes (Recommended)
+```python
+# Control code sequence (recommended)
+control_sequence = "brca1 human primate [SOS] ATGCTAGCTAGCTAGCTAGCTA [EOS]"
+inputs = tokenizer(control_sequence, return_tensors="pt")
 outputs = model(**inputs)
 ```
+### Variant Scoring
+```python
+import pandas as pd
+import torch
+def score_variants_hf(variants_df, gene, species, clade):
+    """
+    Score variants using the Hugging Face model.
+    Args:
+        variants_df: DataFrame with columns ['sequence', 'variant_sequence']
+        gene: Gene name (e.g., 'brca1')
+        species: Species name (e.g., 'human')
+        clade: Clade information (e.g., 'primate')
+    Returns:
+        DataFrame with added 'score' column
+    """
+    scores = []
+    for _, row in variants_df.iterrows():
+        # Create control code sequences
+        ref_seq = f"{gene} {species} {clade} [SOS] {row['sequence']} [EOS]"
+        var_seq = f"{gene} {species} {clade} [SOS] {row['variant_sequence']} [EOS]"
+        # Tokenize sequences
+        ref_inputs = tokenizer(ref_seq, return_tensors="pt")
+        var_inputs = tokenizer(var_seq, return_tensors="pt")
+        # Get model outputs
+        with torch.no_grad():
+            ref_outputs = model(**ref_inputs)
+            var_outputs = model(**var_inputs)
+            # Calculate log-likelihood scores
+            ref_logits = ref_outputs.logits[0, :-1]  # Exclude last token
+            var_logits = var_outputs.logits[0, :-1]
+            ref_tokens = ref_inputs['input_ids'][0, 1:]  # Exclude first token
+            var_tokens = var_inputs['input_ids'][0, 1:]
+            # Calculate sequence likelihood
+            ref_score = torch.nn.functional.cross_entropy(ref_logits, ref_tokens, reduction='sum')
+            var_score = torch.nn.functional.cross_entropy(var_logits, var_tokens, reduction='sum')
+            # Score is the difference (higher = more deleterious)
+            score = (var_score - ref_score).item()
+            scores.append(score)
+    variants_df['score'] = scores
+    return variants_df
+# Example usage
+variants = pd.DataFrame({
+    'sequence': ['ATGCTAGCTAGCTAGCTAGCTA', 'ATGCTAGCTAGCTAGCTAGCTA'],
+    'variant_sequence': ['ATGCTAGCTAGCTAGCTAGCTA', 'ATGCTAGCTAGCTAGCTAGCTA']  # Example variants
+})
+scored_variants = score_variants_hf(variants, gene='brca1', species='human', clade='primate')
+print(scored_variants)
+```
+### Input Format
+The model expects sequences in the format:
+```
+gene species clade [SOS] sequence [EOS]
+```
+Where:
+- `gene`: Gene name (e.g., "brca1", "tp53")
+- `species`: Species name (e.g., "human", "mouse")
+- `clade`: Clade information (e.g., "primate", "mammal")
+- `[SOS]`: Start of sequence token
+- `sequence`: DNA sequence (A, T, G, C)
+- `[EOS]`: End of sequence token
+## Model Architecture
+- **Model type**: Causal Language Model (CTRL-based)
+- **Layers**: 12 transformer layers
+- **Hidden size**: 768 dimensions
+- **Attention heads**: 12
+- **Vocabulary size**: 39,378 tokens
+- **Max sequence length**: 1,007 tokens
+- **Position embeddings**: Adaptive local position embeddings
+## Training Data
+The model was trained on genomic sequences with:
+- DNA sequences up to 1000 base pairs
+- Gene-specific control codes
+- Species and clade information
+- Pre-trained ESM protein embeddings
+- 13.6 million mammalian promoter sequences
+## Performance
+LOL-EVE demonstrates state-of-the-art performance on:
+### Benchmarks
+- **Ultra-rare variant prioritization**: Prioritizing ultra-rare variants in gnomAD
+- **Causal eQTL identification**: Identifying causal expression quantitative trait loci
+- **Transcription factor binding site disruption**: Analyzing TFBS disruption by indels
+### Key Results
+- Superior performance compared to existing methods for promoter indel prediction
+- Effective zero-shot prediction without task-specific training
+- Strong cross-species generalization capabilities
+## Datasets
+- **[LOL-EVE-UltraRare](https://huggingface.co/datasets/Marks-lab/LOL-EVE-UltraRare)** - Ultra-rare variant benchmark dataset
+- **[LOL-EVE-eQTL_benchmark](https://huggingface.co/datasets/Marks-lab/LOL-EVE-eQTL_benchmark)** - eQTL benchmark dataset
 ## Citation
+If you use LOL-EVE in your research, please cite:
 ```bibtex
+@article{loleve2025,
+  title={A Genomic Language Model for Zero-Shot Prediction of Promoter Variant Effects},
   author={[Authors]},
+  journal={MLCB 2025},
+  year={2025}
 }
 ```
 ## License
+This model is released under the MIT License. See the [LICENSE](https://github.com/Marks-lab/LOL-EVE/blob/main/LICENSE) file for more details.
+## Repository
+- **GitHub**: [https://github.com/Marks-lab/LOL-EVE](https://github.com/Marks-lab/LOL-EVE)
+- **Paper**: [MLCB 2025](https://github.com/Marks-lab/LOL-EVE) (link to be updated)
+## Contact
+For questions or issues, please contact [your-email@domain.com] or open an issue on the [GitHub repository](https://github.com/Marks-lab/LOL-EVE).
+## Acknowledgments
+- Built on the Hugging Face Transformers library
+- Uses ESM protein embeddings for gene context
+- Inspired by recent advances in genomic language modeling
+- Trained on mammalian promoter sequences from multiple species

gene_embeddings_v4.npz ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:682e2c17b7b7709a9ea23283d691dcc155eb0ca24e4eda616d3e18011c693586
+size 95669748