Update README.md
Browse filesimprove model card
README.md
CHANGED
|
@@ -1,149 +1,180 @@
|
|
| 1 |
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
- zh
|
|
|
|
| 6 |
tags:
|
| 7 |
- reasoning
|
| 8 |
-
-
|
| 9 |
-
-
|
| 10 |
-
-
|
| 11 |
-
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
# Xmodel-2.5: 1.3B Data-Efficient Reasoning Small Language Model
|
| 15 |
-
|
| 16 |
-
> 在 1-2B 量级上实现 **SOTA 平均推理成绩**(52.49%,13 基准,仅次 Qwen3-1.7B),但只用了 **1.4T token(≈ Qwen3 的 4%)** 与 **1.3B 参数(↓25%)**。
|
| 17 |
-
> 为边缘、成本敏感场景打造的「即插即用」Agent 核心。
|
| 18 |
-
|
| 19 |
---
|
| 20 |
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
|
| 53 |
```python
|
| 54 |
-
from transformers import
|
| 55 |
-
import
|
| 56 |
|
| 57 |
-
|
| 58 |
model = AutoModelForCausalLM.from_pretrained(
|
| 59 |
-
|
| 60 |
-
torch_dtype=
|
| 61 |
-
device_map="auto"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
)
|
| 63 |
|
| 64 |
-
prompt = "
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
## 🛠️ 模型细节
|
| 73 |
-
|
| 74 |
-
| 配置 | 值 |
|
| 75 |
-
|---|---|
|
| 76 |
-
| hidden_size | 1536 |
|
| 77 |
-
| num_layers | 48 |
|
| 78 |
-
| attention_heads | 24 (Q) / 8 (KV, GQA) |
|
| 79 |
-
| intermediate_size | 3840 |
|
| 80 |
-
| max_position_embeddings | 131 072 |
|
| 81 |
-
| RoPE base | 500 k |
|
| 82 |
-
| 训练长度 | 3 712 → 8 192 → 16 384(渐进拉伸) |
|
| 83 |
|
| 84 |
-
|
| 85 |
|
| 86 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
```
|
| 89 |
-
.
|
| 90 |
-
├── README.md # 本文件
|
| 91 |
-
├── config.json # HuggingFace 标准配置
|
| 92 |
-
├── tokenizer.json # DeepSeek-v3 129k
|
| 93 |
-
├── pytorch_model.bin # 1.3B 权重(bf16)
|
| 94 |
-
├── training/ # 训练日志、loss 曲线
|
| 95 |
-
├── evaluation/ # 13 基准评测脚本
|
| 96 |
-
└── mup/ # μP 参数表与代理模型
|
| 97 |
-
```
|
| 98 |
-
|
| 99 |
-
---
|
| 100 |
-
|
| 101 |
-
## 🧪 复现 & 微调
|
| 102 |
-
|
| 103 |
-
1. 安装环境
|
| 104 |
-
```bash
|
| 105 |
-
pip install transformers>=4.40 datasets flash-attn>=2.5
|
| 106 |
-
```
|
| 107 |
-
|
| 108 |
-
2. 继续预训练 / 长上下文拉伸
|
| 109 |
-
```bash
|
| 110 |
-
bash scripts/continue_pretrain.sh {data_path} {save_path}
|
| 111 |
-
```
|
| 112 |
|
| 113 |
-
|
| 114 |
-
```bash
|
| 115 |
-
bash scripts/lora_sft.sh {instruction_data.jsonl}
|
| 116 |
-
```
|
| 117 |
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
---
|
| 121 |
-
|
| 122 |
-
## 📈 训练曲线
|
| 123 |
-
|
| 124 |
-

|
| 125 |
-
*图:三阶段 WSD + 长上下文适配,WikiText-2 perplexity 与 13 任务平均同步下降。*
|
| 126 |
-
|
| 127 |
-
---
|
| 128 |
-
|
| 129 |
-
## 📜 引用
|
| 130 |
|
| 131 |
```bibtex
|
| 132 |
-
@misc{
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
|
|
|
|
|
|
| 138 |
}
|
| 139 |
```
|
| 140 |
|
| 141 |
-
|
| 142 |
|
| 143 |
-
|
|
|
|
|
|
|
| 144 |
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
---
|
| 148 |
|
| 149 |
-
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
language:
|
| 3 |
- en
|
| 4 |
- zh
|
| 5 |
+
license: apache-2.0
|
| 6 |
tags:
|
| 7 |
- reasoning
|
| 8 |
+
- small-language-model
|
| 9 |
+
- efficient-training
|
| 10 |
+
- xmodel
|
| 11 |
+
- xiaoduo-ai
|
| 12 |
+
pipeline_tag: text-generation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
---
|
| 14 |
|
| 15 |
+
# Xmodel-2.5: 1.3B Data-Efficient Reasoning SLM
|
| 16 |
+
|
| 17 |
+
<h5 align="center">
|
| 18 |
+
|
| 19 |
+
[](https://huggingface.co/XiaoduoAILab/Xmodel-2.5)
|
| 20 |
+
[](https://arxiv.org/abs/2511.19496)
|
| 21 |
+
[](https://github.com/XiaoduoAILab/Xmodel-2.5/blob/main/LICENSE)
|
| 22 |
+
[](https://github.com/XiaoduoAILab/Xmodel-2.5)
|
| 23 |
+
[](https://github.com/XiaoduoAILab/Xmodel-2.5)
|
| 24 |
+
|
| 25 |
+
</h5>
|
| 26 |
+
|
| 27 |
+
## Model Description
|
| 28 |
+
|
| 29 |
+
Xmodel-2.5 is a 1.3 billion parameter small language model specifically designed as a **lightweight agent core** for complex reasoning tasks. The model builds upon Xmodel-2 with four key upgrades:
|
| 30 |
+
|
| 31 |
+
1. **Full μP Support**: Extended Megatron-LM to support maximal update parameterization for reliable hyperparameter transfer
|
| 32 |
+
2. **Efficient Tokenizer**: Adopted 129K token DeepSeek-v3 tokenizer for improved compression rate and decoding speed
|
| 33 |
+
3. **FP8 Mixed Precision**: Used E4M3 forward and E5M2 backward FP8 formats to balance precision and throughput
|
| 34 |
+
4. **Optimizer Scheduling**: Switched from AdamW to Muon during decay phase, significantly improving downstream task performance
|
| 35 |
+
|
| 36 |
+
Trained with only 1.4T tokens, Xmodel-2.5 achieves **52.49%** average accuracy across 13 reasoning benchmarks, ranking second among 1-2B parameter models, only behind Qwen3 (56.96%) but with 25.7x fewer training tokens.
|
| 37 |
+
|
| 38 |
+
## Model Architecture
|
| 39 |
+
|
| 40 |
+
| Hyperparameter | Value |
|
| 41 |
+
|----------------|-------|
|
| 42 |
+
| Hidden size | 1536 |
|
| 43 |
+
| Intermediate size | 3840 |
|
| 44 |
+
| Transformer layers | 48 |
|
| 45 |
+
| Attention heads (Q) | 24 |
|
| 46 |
+
| KV heads (GQA) | 8 |
|
| 47 |
+
| Sequence length | 3712 |
|
| 48 |
+
| Max position embeddings | 131072 |
|
| 49 |
+
| RoPE base | 500000 |
|
| 50 |
+
|
| 51 |
+
## Intended Uses & Limitations
|
| 52 |
+
|
| 53 |
+
### Intended Uses
|
| 54 |
+
- Complex reasoning tasks
|
| 55 |
+
- Lightweight AI agent applications
|
| 56 |
+
- Educational and research purposes
|
| 57 |
+
- Resource-constrained environments
|
| 58 |
+
|
| 59 |
+
### Limitations
|
| 60 |
+
- Limited to 1.3B parameter capacity
|
| 61 |
+
- May struggle with highly specialized domains
|
| 62 |
+
- Performance may vary on non-English languages
|
| 63 |
+
|
| 64 |
+
## Training Details
|
| 65 |
+
|
| 66 |
+
### Training Strategy
|
| 67 |
+
- **Three-stage WSD curriculum**: 560k steps, 1.4T tokens
|
| 68 |
+
- **Warmup phase**: 2k steps, linear learning rate increase
|
| 69 |
+
- **Stable phase**: 530k steps, gradually increasing batch size
|
| 70 |
+
- **Decay phase**: 20k steps, mixing 66.9% high-quality SFT data
|
| 71 |
+
- **Long-context adaptation**: 10k additional steps for 16K context support
|
| 72 |
+
|
| 73 |
+
### Key Innovations
|
| 74 |
+
- **μP hyperparameter transfer**: Direct transfer from 20M parameter proxy model to full model
|
| 75 |
+
- **Optimizer switching**: AdamW → Muon during decay phase for improved reasoning performance
|
| 76 |
+
- **FP8 mixed precision**: FP8 format significantly enhances training efficiency
|
| 77 |
+
|
| 78 |
+
## Performance
|
| 79 |
+
|
| 80 |
+
### Comprehensive Reasoning Performance
|
| 81 |
+
|
| 82 |
+
| Model | Parameters | Training Tokens | 13-Task Average |
|
| 83 |
+
|-------|------------|-----------------|------------------|
|
| 84 |
+
| Qwen3-1.7B | 1.7B | 36T | 56.96% |
|
| 85 |
+
| **Xmodel-2.5** | **1.3B** | **1.4T** | **52.49%** |
|
| 86 |
+
| InternLM2.5-1.8B | 1.8B | - | 50.19% |
|
| 87 |
+
| Xmodel-2-1.2B | 1.2B | 1.5T | 50.34% |
|
| 88 |
+
| MiniCPM-1B | 1B | - | 48.95% |
|
| 89 |
+
| SmolLM2-1.7B | 1.7B | 11T | 46.88% |
|
| 90 |
+
| Llama-3.2-1B | 1B | 9T | 44.72% |
|
| 91 |
+
|
| 92 |
+
### Detailed Task Performance
|
| 93 |
+
|
| 94 |
+
| Task | Xmodel-2.5 | Xmodel-2 | Improvement |
|
| 95 |
+
|------|------------|----------|-------------|
|
| 96 |
+
| ARC-Challenge | 48.89 | 46.16 | +2.73 |
|
| 97 |
+
| ARC-Easy | 76.94 | 76.22 | +0.72 |
|
| 98 |
+
| PIQA | 75.95 | 75.14 | +0.81 |
|
| 99 |
+
| HellaSwag | 67.24 | 64.05 | +3.19 |
|
| 100 |
+
| WinoGrande | 64.64 | 64.25 | +0.39 |
|
| 101 |
+
| BBH | 54.58 | 48.90 | +5.68 |
|
| 102 |
+
| MMLU | 51.81 | 49.98 | +1.83 |
|
| 103 |
+
| GSM8k | 58.98 | 56.56 | +2.42 |
|
| 104 |
+
| MATH | 28.94 | 25.64 | +3.30 |
|
| 105 |
+
| HumanEval | 28.66 | 29.27 | -0.61 |
|
| 106 |
+
| MBPP | 33.00 | 30.80 | +2.20 |
|
| 107 |
+
| CMMLU | 47.16 | 44.29 | +2.87 |
|
| 108 |
+
| C-Eval | 45.54 | 43.16 | +2.38 |
|
| 109 |
+
|
| 110 |
+
## How to Use
|
| 111 |
|
| 112 |
```python
|
| 113 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 114 |
+
import os
|
| 115 |
|
| 116 |
+
model_path = "XiaoduoAILab/Xmodel-2.5"
|
| 117 |
model = AutoModelForCausalLM.from_pretrained(
|
| 118 |
+
model_path,
|
| 119 |
+
torch_dtype="auto",
|
| 120 |
+
device_map="auto",
|
| 121 |
+
trust_remote_code=True
|
| 122 |
+
)
|
| 123 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
| 124 |
+
model_path,
|
| 125 |
+
trust_remote_code=True
|
| 126 |
)
|
| 127 |
|
| 128 |
+
prompt = "Explain the concept of transfer learning in machine learning."
|
| 129 |
+
messages = [{"role": "user", "content": prompt}]
|
| 130 |
+
text = tokenizer.apply_chat_template(
|
| 131 |
+
messages,
|
| 132 |
+
tokenize=False,
|
| 133 |
+
add_generation_prompt=True
|
| 134 |
+
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 135 |
|
| 136 |
+
model_inputs = tokenizer(text, return_tensors="pt").to(model.device)
|
| 137 |
|
| 138 |
+
# Generation configuration
|
| 139 |
+
generated_ids = model.generate(
|
| 140 |
+
**model_inputs,
|
| 141 |
+
max_new_tokens=512,
|
| 142 |
+
do_sample=True,
|
| 143 |
+
top_p=0.9,
|
| 144 |
+
temperature=0.7,
|
| 145 |
+
pad_token_id=tokenizer.eos_token_id
|
| 146 |
+
)
|
| 147 |
|
| 148 |
+
output = tokenizer.decode(
|
| 149 |
+
generated_ids[0][len(model_inputs.input_ids[0]):],
|
| 150 |
+
skip_special_tokens=True
|
| 151 |
+
)
|
| 152 |
+
print("Generated Response:")
|
| 153 |
+
print(output)
|
| 154 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 155 |
|
| 156 |
+
## Citation
|
|
|
|
|
|
|
|
|
|
| 157 |
|
| 158 |
+
If you find Xmodel-2.5 useful for your research or applications, please consider citing our work:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 159 |
|
| 160 |
```bibtex
|
| 161 |
+
@misc{liu2025xmodel25,
|
| 162 |
+
title={Xmodel-2.5: 1.3B Data-Efficient Reasoning SLM},
|
| 163 |
+
author={Yang Liu and Xiaolong Zhong and Ling Jiang},
|
| 164 |
+
year={2025},
|
| 165 |
+
eprint={2511.19496},
|
| 166 |
+
archivePrefix={arXiv},
|
| 167 |
+
primaryClass={cs.LG},
|
| 168 |
+
url={https://arxiv.org/abs/2511.19496},
|
| 169 |
}
|
| 170 |
```
|
| 171 |
|
| 172 |
+
## Contact
|
| 173 |
|
| 174 |
+
For questions or suggestions, please contact us through:
|
| 175 |
+
- GitHub Issues: [Xmodel-2.5 Issues](https://github.com/XiaoduoAILab/Xmodel-2.5/issues)
|
| 176 |
+
- Email: [email protected]
|
| 177 |
|
| 178 |
+
## License
|
|
|
|
|
|
|
| 179 |
|
| 180 |
+
This project is licensed under the Apache-2.0 License. See the [LICENSE](LICENSE) file for details.
|