MicroFlow: A Pretrained Mixture of Experts Model for Microbial Community Analysis
Model Description
MicroFlow is a pretrained language model based on the Mixtral architecture, specifically designed for analyzing microbial community composition at the species level. This model has been trained on extensive taxonomic profile data to understand and generate microbial community structures, serving as a foundation for various downstream bioinformatics applications.
Key Features
1. Architecture Design
- Base Architecture: Mixture of Experts (MoE) pretrained model based on Mixtral
- Parameter Scale: Approximately 130 million pretrained parameters
- Attention Mechanism: Bidirectional attention mechanism (non-causal) implemented via custom SDPA (Scaled Dot Product Attention) with GQA support
- Tokenization: BPE (Byte-Pair Encoding) tokenization with a vocabulary of 30,020 tokens optimized for microbial taxonomy
- Position Encoding: RoPE (Rotary Positional Encoding) with theta=1e6
- Expert System: 8 local experts with 2 experts activated per token
2. Pretraining Strategy
- Pretraining Data: 3,256,608 microbial community samples in Parquet format (with data augmentation by randomly shuffling the order of taxa within samples)
- Taxonomic Level: Species-level classification (with 's__' prefix removed)
- Training Objectives:
- Masked Language Modeling (15% masking probability)
- BERT-style pretraining strategy with bidirectional attention
- Multi-sequence length pretraining (3072, 8192 tokens)
Important:
This model requires proper setup of the custom bidirectional attention mechanism before loading. Ensure you follow the setup steps in the correct order:
1) Define custom attention function,
2) Register it,
3) Configure model with `attn_implementation='custom'`,
4) Load model weights.
The extracted embeddings capture deep semantic information about microbial communities and can be used directly for various analysis tasks without further training.
Citation
If you use this pretrained model in your research, please cite:
@software{microflow2025,
title = {MicroFlow: A Pretrained Mixture of Experts Model for Microbial Community Analysis},
author = {Zhang, Chao},
year = {2025},
url = {https://github.com/zhangchao162/microflow},
note = {Pretrained language model with bidirectional attention and BPE tokenization for species-level microbial community data}
}
Contact
For questions about the pretrained model or fine-tuning guidance, please contact [email protected]
- Downloads last month
- 11