Update README.md
Browse files
README.md
CHANGED
|
@@ -24,12 +24,11 @@ Trained on __20T+ tokens of high-quality data__, together with __supervised fine
|
|
| 24 |
### Powerful Complex Reasoning Abilities
|
| 25 |
|
| 26 |
We conducted a comprehensive evaluation of Ling-flash-2.0βs reasoning capabilities, reporting strong results on representative benchmarks:
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
Compared with __dense models under 40B__ (e.g., Qwen3-32B-Non-Thinking, Seed-OSS-36B-Instruct (think budget=0)) and __larger-activation/total-parameter MoE models__ (e.g., Hunyuan-A13B-Instruct, GPT-OSS-120B/low), __Ling-flash-2.0__ demonstrates stronger complex reasoning power. Moreover, it shows high competitiveness on __creative tasks__ (Creative Writing v3).
|
| 34 |
<p align="center">
|
| 35 |
<img src="https://mdn.alipayobjects.com/huamei_fi95qp/afts/img/zxAvQ7QtrAwAAAAAQqAAAAgADkZ7AQFr/fmt.webp"/>
|
|
@@ -47,8 +46,8 @@ Compared with __dense models under 40B__ (e.g., Qwen3-32B-Non-Thinking, Seed-OSS
|
|
| 47 |
|
| 48 |
Guided by [Ling Scaling Laws](https://arxiv.org/abs/2507.17702), Ling 2.0 adopts a __1/32 activation-ratio MoE architecture__, optimized across multiple design choices: expert granularity, shared-expert ratio, attention balance, __aux-loss-free + sigmoid routing strategy__, MTP layers, QK-Norm, Partial-RoPE, and more. These refinements enable __small-activation MoE__ models to achieve __7Γ efficiency gains__ over equivalent dense architectures.
|
| 49 |
In other words, with just __6.1B activated parameters (4.8B non-embedding)__, __Ling-flash-2.0__ can match the performance of ~40B dense models. Thanks to its small activation size, it also delivers major inference speed advantages:
|
| 50 |
-
|
| 51 |
-
|
| 52 |
|
| 53 |
|
| 54 |
<p align="center">
|
|
@@ -78,18 +77,6 @@ Note: If you are interested in previous version, please visit the past model col
|
|
| 78 |
|
| 79 |
## Quickstart
|
| 80 |
|
| 81 |
-
### Convert to safetensors
|
| 82 |
-
|
| 83 |
-
Models with safetensors format can be downloaded from [HuggingFace](https://huggingface.co/inclusionAI) or [ModelScope](https://modelscope.cn/organization/inclusionAI).
|
| 84 |
-
If you want to train your model and eval it, you can convert from dcp produced by training.
|
| 85 |
-
```shell
|
| 86 |
-
python tools/convert_dcp_to_safe_tensors.py --checkpoint-path ${DCP_PATH} --target-path ${SAFETENSORS_PATH}
|
| 87 |
-
```
|
| 88 |
-
|
| 89 |
-
Currently, BF16 and FP8 formats are supported, you can use convert parameter to handle it:
|
| 90 |
-
- `--force-bf16` for BF16 format.
|
| 91 |
-
- `--force-fp8` for FP8 format.
|
| 92 |
-
|
| 93 |
### π€ Hugging Face Transformers
|
| 94 |
|
| 95 |
Here is a code snippet to show you how to use the chat model with `transformers`:
|
|
@@ -153,7 +140,7 @@ pip install -e .
|
|
| 153 |
|
| 154 |
#### Offline Inference:
|
| 155 |
|
| 156 |
-
```
|
| 157 |
from transformers import AutoTokenizer
|
| 158 |
from vllm import LLM, SamplingParams
|
| 159 |
|
|
@@ -238,12 +225,13 @@ MTP is supported for base model, and not yet for chat model. You can add paramet
|
|
| 238 |
to start command.
|
| 239 |
|
| 240 |
- Client:
|
|
|
|
| 241 |
```shell
|
| 242 |
curl -s http://localhost:${PORT}/v1/chat/completions \
|
| 243 |
-H "Content-Type: application/json" \
|
| 244 |
-d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
|
| 245 |
-
"""
|
| 246 |
```
|
|
|
|
| 247 |
More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
|
| 248 |
|
| 249 |
|
|
|
|
| 24 |
### Powerful Complex Reasoning Abilities
|
| 25 |
|
| 26 |
We conducted a comprehensive evaluation of Ling-flash-2.0βs reasoning capabilities, reporting strong results on representative benchmarks:
|
| 27 |
+
β __Multi-disciplinary knowledge reasoning__: GPQA-Diamond, MMLU-Pro
|
| 28 |
+
β __Advanced mathematical reasoning__: AIME 2025, Omni-MATH, OptMATH (advanced mathematical optimization tasks)
|
| 29 |
+
β __Challenging code generation__: LiveCodeBench v6, CodeForces-Elo
|
| 30 |
+
β __Logical reasoning__: KOR-Bench, ARC-Prize
|
| 31 |
+
β __Key regulated industries (Finance, Healthcare)__: FinanceReasoning, HealthBench
|
|
|
|
| 32 |
Compared with __dense models under 40B__ (e.g., Qwen3-32B-Non-Thinking, Seed-OSS-36B-Instruct (think budget=0)) and __larger-activation/total-parameter MoE models__ (e.g., Hunyuan-A13B-Instruct, GPT-OSS-120B/low), __Ling-flash-2.0__ demonstrates stronger complex reasoning power. Moreover, it shows high competitiveness on __creative tasks__ (Creative Writing v3).
|
| 33 |
<p align="center">
|
| 34 |
<img src="https://mdn.alipayobjects.com/huamei_fi95qp/afts/img/zxAvQ7QtrAwAAAAAQqAAAAgADkZ7AQFr/fmt.webp"/>
|
|
|
|
| 46 |
|
| 47 |
Guided by [Ling Scaling Laws](https://arxiv.org/abs/2507.17702), Ling 2.0 adopts a __1/32 activation-ratio MoE architecture__, optimized across multiple design choices: expert granularity, shared-expert ratio, attention balance, __aux-loss-free + sigmoid routing strategy__, MTP layers, QK-Norm, Partial-RoPE, and more. These refinements enable __small-activation MoE__ models to achieve __7Γ efficiency gains__ over equivalent dense architectures.
|
| 48 |
In other words, with just __6.1B activated parameters (4.8B non-embedding)__, __Ling-flash-2.0__ can match the performance of ~40B dense models. Thanks to its small activation size, it also delivers major inference speed advantages:
|
| 49 |
+
β On __H20 hardware__, Ling-flash-2.0 achieves __200+ tokens/s__, offering __3Γ speedups__ compared to 36B dense models in everyday use.
|
| 50 |
+
β With __YaRN extrapolation__, it supports __128K context length__, and as output length grows, its relative speedup can reach __7Γ or more__.
|
| 51 |
|
| 52 |
|
| 53 |
<p align="center">
|
|
|
|
| 77 |
|
| 78 |
## Quickstart
|
| 79 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
### π€ Hugging Face Transformers
|
| 81 |
|
| 82 |
Here is a code snippet to show you how to use the chat model with `transformers`:
|
|
|
|
| 140 |
|
| 141 |
#### Offline Inference:
|
| 142 |
|
| 143 |
+
```python
|
| 144 |
from transformers import AutoTokenizer
|
| 145 |
from vllm import LLM, SamplingParams
|
| 146 |
|
|
|
|
| 225 |
to start command.
|
| 226 |
|
| 227 |
- Client:
|
| 228 |
+
|
| 229 |
```shell
|
| 230 |
curl -s http://localhost:${PORT}/v1/chat/completions \
|
| 231 |
-H "Content-Type: application/json" \
|
| 232 |
-d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
|
|
|
|
| 233 |
```
|
| 234 |
+
|
| 235 |
More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
|
| 236 |
|
| 237 |
|