XiaoduoAILab
/

Xmodel-2.5

@@ -1,149 +1,180 @@
 ---
-license: apache-2.0
 language:
 - en
 - zh
 tags:
 - reasoning
-- lightweight
-- agent
-- 1.3B
-- SLM
----
-# Xmodel-2.5: 1.3B Data-Efficient Reasoning Small Language Model
-> 在 1-2B 量级上实现 **SOTA 平均推理成绩**（52.49%，13 基准，仅次 Qwen3-1.7B），但只用了 **1.4T token（≈ Qwen3 的 4%）** 与 **1.3B 参数（↓25%）**。
-> 为边缘、成本敏感场景打造的「即插即用」Agent 核心。
 ---
-## 🧠 关键亮点
-| 特性 | 数值 / 描述 |
-|---|---|
-| **参数量** | 1.3 B（decoder-only，深-窄结构） |
-| **训练 token** | 1.4 T（Warmup-Stable-Decay 三阶段） |
-| **上下文长度** | 16 k（RoPE-500k 底座，131k 位置编码） |
-| **精度** | FP8-mixed（吞吐↑30%，无损精度） |
-| **优化器** | 前 540k 步 AdamW → 后 20k 步 Muon（+4.58% 推理平均） |
-| **分词器** | DeepSeek-v3 129k（压缩率↑） |
-| **μP 迁移** | 26 M 代理模型零调参直达 1.3 B，训练动态一致 |
-| **开源协议** | Apache-2.0（代码/权重/训练日志全放） |
----
-## 📊 推理基准（1-shot / few-shot）
-| 任务 | 指标 | Xmodel-2.5 | Qwen3-1.7B | MiniCPM-1B |
-|---|---|---|---|---|
-| GSM8k | 5-shot EM | **58.98** | 69.29 | 42.00 |
-| MATH | 4-shot | **28.94** | 35.50 | 12.06 |
-| BBH | 3-shot | **54.58** | 45.23 | 35.45 |
-| MMLU | 5-shot | **51.81** | 60.24 | 48.75 |
-| ARC-c | 25-shot | **48.89** | 53.07 | 45.31 |
-| **13 任务平均** | — | **52.49** | 56.96 | 48.95 |
-> 在 **1-2B 俱乐部** 中排名第二，**每参数推理效率第一**。
----
-## 🚀 快速开始
 ```python
-from transformers import AutoTokenizer, AutoModelForCausalLM
-import torch
-tok = AutoTokenizer.from_pretrained("XiaoduoAILab/Xmodel-2.5", trust_remote_code=True)
 model = AutoModelForCausalLM.from_pretrained(
-    "XiaoduoAILab/Xmodel-2.5",
-    torch_dtype=torch.bfloat16,
-    device_map="auto"
 )
-prompt = "Solve: A bookstore has 240 books. If they sell 15% of them, how many are left?"
-inputs = tok(prompt, return_tensors="pt").to(model.device)
-out = model.generate(**inputs, max_new_tokens=128, do_sample=False)
-print(tok.decode(out[0], skip_special_tokens=True))
-```
----
-## 🛠️ 模型细节
-| 配置 | 值 |
-|---|---|
-| hidden_size | 1536 |
-| num_layers | 48 |
-| attention_heads | 24 (Q) / 8 (KV, GQA) |
-| intermediate_size | 3840 |
-| max_position_embeddings | 131 072 |
-| RoPE base | 500 k |
-| 训练长度 | 3 712 → 8 192 → 16 384（渐进拉伸） |
----
-## 📁 仓库内容
 ```
-.
-├── README.md               # 本文件
-├── config.json             # HuggingFace 标准配置
-├── tokenizer.json          # DeepSeek-v3 129k
-├── pytorch_model.bin       # 1.3B 权重（bf16）
-├── training/               # 训练日志、loss 曲线
-├── evaluation/             # 13 基准评测脚本
-└── mup/                    # μP 参数表与代理模型
-```
----
-## 🧪 复现 & 微调
-1. 安装环境
-   ```bash
-   pip install transformers>=4.40 datasets flash-attn>=2.5
-   ```
-2. 继续预训练 / 长上下文拉伸
-   ```bash
-   bash scripts/continue_pretrain.sh {data_path} {save_path}
-   ```
-3. 下游微调示例（LoRA）
-   ```bash
-   bash scripts/lora_sft.sh {instruction_data.jsonl}
-   ```
-> 所有脚本均适配 Megatron-LM + Transformer-Engine FP8。
----
-## 📈 训练曲线
-![WikiText-2 Loss](https://huggingface.co/XiaoduoAILab/Xmodel-2.5/resolve/main/loss_curve.png)
-*图：三阶段 WSD + 长上下文适配，WikiText-2 perplexity 与 13 任务平均同步下降。*
----
-## 📜 引用
 ```bibtex
-@misc{xmodel25,
-  title={Xmodel-2.5: 1.3B Data-Efficient Reasoning SLM},
-  author={Yang Liu and Xiaolong Zhong and Ling Jiang},
-  year={2025},
-  publisher={Xiaoduo AI Lab},
-  url={https://huggingface.co/XiaoduoAILab/Xmodel-2.5}
 }
 ```
----
-## 🤝 致谢
-采用 [μP](https://github.com/microsoft/mup)、[Muon](https://github.com/KellerJordan/Muon)、[Transformer-Engine](https://github.com/NVIDIA/TransformerEngine) 等开源方案，感谢社区贡献。
----
-💬 如有问题，请在 [Discussions](https://huggingface.co/XiaoduoAILab/Xmodel-2.5/discussions) 留言，或提 [Issue](https://github.com/XiaoduoAILab/Xmodel-2.5/issues)。

 ---
 language:
 - en
 - zh
+license: apache-2.0
 tags:
 - reasoning
+- small-language-model
+- efficient-training
+- xmodel
+- xiaoduo-ai
+pipeline_tag: text-generation
 ---
+# Xmodel-2.5: 1.3B Data-Efficient Reasoning SLM
+<h5 align="center">
+[![hf_space](https://img.shields.io/badge/🤗-Xiaoduo%20HuggingFace-blue.svg)](https://huggingface.co/XiaoduoAILab/Xmodel-2.5)
+[![arXiv](https://img.shields.io/badge/Arxiv-2511.19496-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2511.19496)
+[![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg)](https://github.com/XiaoduoAILab/Xmodel-2.5/blob/main/LICENSE)
+[![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/XiaoduoAILab/Xmodel-2.5)
+[![github](https://img.shields.io/github/stars/XiaoduoAILab/Xmodel-2.5.svg?style=social)](https://github.com/XiaoduoAILab/Xmodel-2.5)
+</h5>
+## Model Description
+Xmodel-2.5 is a 1.3 billion parameter small language model specifically designed as a **lightweight agent core** for complex reasoning tasks. The model builds upon Xmodel-2 with four key upgrades:
+1. **Full μP Support**: Extended Megatron-LM to support maximal update parameterization for reliable hyperparameter transfer
+2. **Efficient Tokenizer**: Adopted 129K token DeepSeek-v3 tokenizer for improved compression rate and decoding speed
+3. **FP8 Mixed Precision**: Used E4M3 forward and E5M2 backward FP8 formats to balance precision and throughput
+4. **Optimizer Scheduling**: Switched from AdamW to Muon during decay phase, significantly improving downstream task performance
+Trained with only 1.4T tokens, Xmodel-2.5 achieves **52.49%** average accuracy across 13 reasoning benchmarks, ranking second among 1-2B parameter models, only behind Qwen3 (56.96%) but with 25.7x fewer training tokens.
+## Model Architecture
+| Hyperparameter | Value |
+|----------------|-------|
+| Hidden size | 1536 |
+| Intermediate size | 3840 |
+| Transformer layers | 48 |
+| Attention heads (Q) | 24 |
+| KV heads (GQA) | 8 |
+| Sequence length | 3712 |
+| Max position embeddings | 131072 |
+| RoPE base | 500000 |
+## Intended Uses & Limitations
+### Intended Uses
+- Complex reasoning tasks
+- Lightweight AI agent applications
+- Educational and research purposes
+- Resource-constrained environments
+### Limitations
+- Limited to 1.3B parameter capacity
+- May struggle with highly specialized domains
+- Performance may vary on non-English languages
+## Training Details
+### Training Strategy
+- **Three-stage WSD curriculum**: 560k steps, 1.4T tokens
+- **Warmup phase**: 2k steps, linear learning rate increase
+- **Stable phase**: 530k steps, gradually increasing batch size
+- **Decay phase**: 20k steps, mixing 66.9% high-quality SFT data
+- **Long-context adaptation**: 10k additional steps for 16K context support
+### Key Innovations
+- **μP hyperparameter transfer**: Direct transfer from 20M parameter proxy model to full model
+- **Optimizer switching**: AdamW → Muon during decay phase for improved reasoning performance
+- **FP8 mixed precision**: FP8 format significantly enhances training efficiency
+## Performance
+### Comprehensive Reasoning Performance
+| Model | Parameters | Training Tokens | 13-Task Average |
+|-------|------------|-----------------|------------------|
+| Qwen3-1.7B | 1.7B | 36T | 56.96% |
+| **Xmodel-2.5** | **1.3B** | **1.4T** | **52.49%** |
+| InternLM2.5-1.8B | 1.8B | - | 50.19% |
+| Xmodel-2-1.2B | 1.2B | 1.5T | 50.34% |
+| MiniCPM-1B | 1B | - | 48.95% |
+| SmolLM2-1.7B | 1.7B | 11T | 46.88% |
+| Llama-3.2-1B | 1B | 9T | 44.72% |
+### Detailed Task Performance
+| Task | Xmodel-2.5 | Xmodel-2 | Improvement |
+|------|------------|----------|-------------|
+| ARC-Challenge | 48.89 | 46.16 | +2.73 |
+| ARC-Easy | 76.94 | 76.22 | +0.72 |
+| PIQA | 75.95 | 75.14 | +0.81 |
+| HellaSwag | 67.24 | 64.05 | +3.19 |
+| WinoGrande | 64.64 | 64.25 | +0.39 |
+| BBH | 54.58 | 48.90 | +5.68 |
+| MMLU | 51.81 | 49.98 | +1.83 |
+| GSM8k | 58.98 | 56.56 | +2.42 |
+| MATH | 28.94 | 25.64 | +3.30 |
+| HumanEval | 28.66 | 29.27 | -0.61 |
+| MBPP | 33.00 | 30.80 | +2.20 |
+| CMMLU | 47.16 | 44.29 | +2.87 |
+| C-Eval | 45.54 | 43.16 | +2.38 |
+## How to Use
 ```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import os
+model_path = "XiaoduoAILab/Xmodel-2.5"
 model = AutoModelForCausalLM.from_pretrained(
+    model_path,
+    torch_dtype="auto",
+    device_map="auto",
+    trust_remote_code=True
+)
+tokenizer = AutoTokenizer.from_pretrained(
+    model_path,
+    trust_remote_code=True
 )
+prompt = "Explain the concept of transfer learning in machine learning."
+messages = [{"role": "user", "content": prompt}]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+model_inputs = tokenizer(text, return_tensors="pt").to(model.device)
+# Generation configuration
+generated_ids = model.generate(
+    **model_inputs,
+    max_new_tokens=512,
+    do_sample=True,
+    top_p=0.9,
+    temperature=0.7,
+    pad_token_id=tokenizer.eos_token_id
+)
+output = tokenizer.decode(
+    generated_ids[0][len(model_inputs.input_ids[0]):],
+    skip_special_tokens=True
+)
+print("Generated Response:")
+print(output)
 ```
+## Citation
+If you find Xmodel-2.5 useful for your research or applications, please consider citing our work:
 ```bibtex
+@misc{liu2025xmodel25,
+      title={Xmodel-2.5: 1.3B Data-Efficient Reasoning SLM},
+      author={Yang Liu and Xiaolong Zhong and Ling Jiang},
+      year={2025},
+      eprint={2511.19496},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG},
+      url={https://arxiv.org/abs/2511.19496},
 }
 ```
+## Contact
+For questions or suggestions, please contact us through:
+- GitHub Issues: [Xmodel-2.5 Issues](https://github.com/XiaoduoAILab/Xmodel-2.5/issues)
+- Email: [email protected]
+## License
+This project is licensed under the Apache-2.0 License. See the [LICENSE](LICENSE) file for details.