Genomic DNA Sequence Transformer
Overview
This model is a BERT-based encoder pre-trained on the human reference genome (GRCh38). It utilizes a k-mer tokenization approach to learn the underlying semantics of DNA, enabling high-accuracy downstream tasks such as promoter identification, splice site prediction, and variant effect scoring.
Model Architecture
Based on the DNABERT framework:
- Tokenization: Sequences are converted into 6-mer tokens (e.g.,
ATGCGT). - Pre-training: Masked Language Modeling (MLM) was performed on over 3 billion base pairs.
- Encoding: The bidirectional attention mechanism allows each nucleotide position to attend to the entire sequence context, capturing complex regulatory motifs.
- Metric: The pre-training objective minimizes the negative log-likelihood:
Intended Use
- Motif Discovery: Locating transcription factor binding sites.
- Functional Annotation: Predicting the biological function of non-coding regions.
- Comparative Genomics: Evaluating evolutionary conservation at a sequence level.
Limitations
- Sequence Length: Restricted to 512 tokens (~517 base pairs including overlaps), making it unsuitable for analyzing whole chromosomes without sliding windows.
- Species Specificity: Performance may vary on non-human genomes (e.g., extremophile bacteria or complex plant genomes) without further fine-tuning.
- Structural Variants: Primarily focused on single-nucleotide patterns rather than large-scale structural re-arrangements.
- Downloads last month
- 11
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support