mx1 commited on
Commit
f9d99bd
Β·
verified Β·
1 Parent(s): 7e0ba32

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -22
README.md CHANGED
@@ -24,12 +24,11 @@ Trained on __20T+ tokens of high-quality data__, together with __supervised fine
24
  ### Powerful Complex Reasoning Abilities
25
 
26
  We conducted a comprehensive evaluation of Ling-flash-2.0’s reasoning capabilities, reporting strong results on representative benchmarks:
27
- * __Multi-disciplinary knowledge reasoning__: GPQA-Diamond, MMLU-Pro
28
- * __Advanced mathematical reasoning__: AIME 2025, Omni-MATH, OptMATH (advanced mathematical optimization tasks)
29
- * __Challenging code generation__: LiveCodeBench v6, CodeForces-Elo
30
- * __Logical reasoning__: KOR-Bench, ARC-Prize
31
- * __Key regulated industries (Finance, Healthcare)__: FinanceReasoning, HealthBench
32
-
33
  Compared with __dense models under 40B__ (e.g., Qwen3-32B-Non-Thinking, Seed-OSS-36B-Instruct (think budget=0)) and __larger-activation/total-parameter MoE models__ (e.g., Hunyuan-A13B-Instruct, GPT-OSS-120B/low), __Ling-flash-2.0__ demonstrates stronger complex reasoning power. Moreover, it shows high competitiveness on __creative tasks__ (Creative Writing v3).
34
  <p align="center">
35
  <img src="https://mdn.alipayobjects.com/huamei_fi95qp/afts/img/zxAvQ7QtrAwAAAAAQqAAAAgADkZ7AQFr/fmt.webp"/>
@@ -47,8 +46,8 @@ Compared with __dense models under 40B__ (e.g., Qwen3-32B-Non-Thinking, Seed-OSS
47
 
48
  Guided by [Ling Scaling Laws](https://arxiv.org/abs/2507.17702), Ling 2.0 adopts a __1/32 activation-ratio MoE architecture__, optimized across multiple design choices: expert granularity, shared-expert ratio, attention balance, __aux-loss-free + sigmoid routing strategy__, MTP layers, QK-Norm, Partial-RoPE, and more. These refinements enable __small-activation MoE__ models to achieve __7Γ— efficiency gains__ over equivalent dense architectures.
49
  In other words, with just __6.1B activated parameters (4.8B non-embedding)__, __Ling-flash-2.0__ can match the performance of ~40B dense models. Thanks to its small activation size, it also delivers major inference speed advantages:
50
- - On __H20 hardware__, Ling-flash-2.0 achieves __200+ tokens/s__, offering __3Γ— speedups__ compared to 36B dense models in everyday use.
51
- - With __YaRN extrapolation__, it supports __128K context length__, and as output length grows, its relative speedup can reach __7Γ— or more__.
52
 
53
 
54
  <p align="center">
@@ -78,18 +77,6 @@ Note: If you are interested in previous version, please visit the past model col
78
 
79
  ## Quickstart
80
 
81
- ### Convert to safetensors
82
-
83
- Models with safetensors format can be downloaded from [HuggingFace](https://huggingface.co/inclusionAI) or [ModelScope](https://modelscope.cn/organization/inclusionAI).
84
- If you want to train your model and eval it, you can convert from dcp produced by training.
85
- ```shell
86
- python tools/convert_dcp_to_safe_tensors.py --checkpoint-path ${DCP_PATH} --target-path ${SAFETENSORS_PATH}
87
- ```
88
-
89
- Currently, BF16 and FP8 formats are supported, you can use convert parameter to handle it:
90
- - `--force-bf16` for BF16 format.
91
- - `--force-fp8` for FP8 format.
92
-
93
  ### πŸ€— Hugging Face Transformers
94
 
95
  Here is a code snippet to show you how to use the chat model with `transformers`:
@@ -153,7 +140,7 @@ pip install -e .
153
 
154
  #### Offline Inference:
155
 
156
- ```bash
157
  from transformers import AutoTokenizer
158
  from vllm import LLM, SamplingParams
159
 
@@ -238,12 +225,13 @@ MTP is supported for base model, and not yet for chat model. You can add paramet
238
  to start command.
239
 
240
  - Client:
 
241
  ```shell
242
  curl -s http://localhost:${PORT}/v1/chat/completions \
243
  -H "Content-Type: application/json" \
244
  -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
245
- """
246
  ```
 
247
  More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
248
 
249
 
 
24
  ### Powerful Complex Reasoning Abilities
25
 
26
  We conducted a comprehensive evaluation of Ling-flash-2.0’s reasoning capabilities, reporting strong results on representative benchmarks:
27
+ ● __Multi-disciplinary knowledge reasoning__: GPQA-Diamond, MMLU-Pro
28
+ ● __Advanced mathematical reasoning__: AIME 2025, Omni-MATH, OptMATH (advanced mathematical optimization tasks)
29
+ ● __Challenging code generation__: LiveCodeBench v6, CodeForces-Elo
30
+ ● __Logical reasoning__: KOR-Bench, ARC-Prize
31
+ ● __Key regulated industries (Finance, Healthcare)__: FinanceReasoning, HealthBench
 
32
  Compared with __dense models under 40B__ (e.g., Qwen3-32B-Non-Thinking, Seed-OSS-36B-Instruct (think budget=0)) and __larger-activation/total-parameter MoE models__ (e.g., Hunyuan-A13B-Instruct, GPT-OSS-120B/low), __Ling-flash-2.0__ demonstrates stronger complex reasoning power. Moreover, it shows high competitiveness on __creative tasks__ (Creative Writing v3).
33
  <p align="center">
34
  <img src="https://mdn.alipayobjects.com/huamei_fi95qp/afts/img/zxAvQ7QtrAwAAAAAQqAAAAgADkZ7AQFr/fmt.webp"/>
 
46
 
47
  Guided by [Ling Scaling Laws](https://arxiv.org/abs/2507.17702), Ling 2.0 adopts a __1/32 activation-ratio MoE architecture__, optimized across multiple design choices: expert granularity, shared-expert ratio, attention balance, __aux-loss-free + sigmoid routing strategy__, MTP layers, QK-Norm, Partial-RoPE, and more. These refinements enable __small-activation MoE__ models to achieve __7Γ— efficiency gains__ over equivalent dense architectures.
48
  In other words, with just __6.1B activated parameters (4.8B non-embedding)__, __Ling-flash-2.0__ can match the performance of ~40B dense models. Thanks to its small activation size, it also delivers major inference speed advantages:
49
+ ● On __H20 hardware__, Ling-flash-2.0 achieves __200+ tokens/s__, offering __3Γ— speedups__ compared to 36B dense models in everyday use.
50
+ ● With __YaRN extrapolation__, it supports __128K context length__, and as output length grows, its relative speedup can reach __7Γ— or more__.
51
 
52
 
53
  <p align="center">
 
77
 
78
  ## Quickstart
79
 
 
 
 
 
 
 
 
 
 
 
 
 
80
  ### πŸ€— Hugging Face Transformers
81
 
82
  Here is a code snippet to show you how to use the chat model with `transformers`:
 
140
 
141
  #### Offline Inference:
142
 
143
+ ```python
144
  from transformers import AutoTokenizer
145
  from vllm import LLM, SamplingParams
146
 
 
225
  to start command.
226
 
227
  - Client:
228
+
229
  ```shell
230
  curl -s http://localhost:${PORT}/v1/chat/completions \
231
  -H "Content-Type: application/json" \
232
  -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
 
233
  ```
234
+
235
  More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
236
 
237