inclusionAI
/

Ling-flash-2.0

@@ -24,12 +24,11 @@ Trained on __20T+ tokens of high-quality data__, together with __supervised fine
 ### Powerful Complex Reasoning Abilities
 We conducted a comprehensive evaluation of Ling-flash-2.0’s reasoning capabilities, reporting strong results on representative benchmarks:
-* __Multi-disciplinary knowledge reasoning__: GPQA-Diamond, MMLU-Pro
-* __Advanced mathematical reasoning__: AIME 2025, Omni-MATH, OptMATH (advanced mathematical optimization tasks)
-* __Challenging code generation__: LiveCodeBench v6, CodeForces-Elo
-* __Logical reasoning__: KOR-Bench, ARC-Prize
-* __Key regulated industries (Finance, Healthcare)__: FinanceReasoning, HealthBench
 Compared with __dense models under 40B__ (e.g., Qwen3-32B-Non-Thinking, Seed-OSS-36B-Instruct (think budget=0)) and __larger-activation/total-parameter MoE models__ (e.g., Hunyuan-A13B-Instruct, GPT-OSS-120B/low), __Ling-flash-2.0__ demonstrates stronger complex reasoning power. Moreover, it shows high competitiveness on __creative tasks__ (Creative Writing v3).
 <p align="center">
     <img src="https://mdn.alipayobjects.com/huamei_fi95qp/afts/img/zxAvQ7QtrAwAAAAAQqAAAAgADkZ7AQFr/fmt.webp"/>
@@ -47,8 +46,8 @@ Compared with __dense models under 40B__ (e.g., Qwen3-32B-Non-Thinking, Seed-OSS
 Guided by [Ling Scaling Laws](https://arxiv.org/abs/2507.17702), Ling 2.0 adopts a __1/32 activation-ratio MoE architecture__, optimized across multiple design choices: expert granularity, shared-expert ratio, attention balance, __aux-loss-free + sigmoid routing strategy__, MTP layers, QK-Norm, Partial-RoPE, and more. These refinements enable __small-activation MoE__ models to achieve __7× efficiency gains__ over equivalent dense architectures.
 In other words, with just __6.1B activated parameters (4.8B non-embedding)__, __Ling-flash-2.0__ can match the performance of ~40B dense models. Thanks to its small activation size, it also delivers major inference speed advantages:
-- On __H20 hardware__, Ling-flash-2.0 achieves __200+ tokens/s__, offering __3× speedups__ compared to 36B dense models in everyday use.
-- With __YaRN extrapolation__, it supports __128K context length__, and as output length grows, its relative speedup can reach __7× or more__.
 <p align="center">
@@ -78,18 +77,6 @@ Note: If you are interested in previous version, please visit the past model col
 ## Quickstart
-### Convert to safetensors
-Models with safetensors format can be downloaded from [HuggingFace](https://huggingface.co/inclusionAI) or [ModelScope](https://modelscope.cn/organization/inclusionAI).
-If you want to train your model and eval it, you can convert from dcp produced by training.
-```shell
-python tools/convert_dcp_to_safe_tensors.py --checkpoint-path ${DCP_PATH} --target-path ${SAFETENSORS_PATH}
-```
-Currently, BF16 and FP8 formats are supported, you can use convert parameter to handle it:
-- `--force-bf16` for BF16 format.
-- `--force-fp8` for FP8 format.
 ### 🤗 Hugging Face Transformers
 Here is a code snippet to show you how to use the chat model with `transformers`:
@@ -153,7 +140,7 @@ pip install -e .
 #### Offline Inference:
-```bash
 from transformers import AutoTokenizer
 from vllm import LLM, SamplingParams
@@ -238,12 +225,13 @@ MTP is supported for base model, and not yet for chat model. You can add paramet
 to start command.
 - Client:
 ```shell
 curl -s http://localhost:${PORT}/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
-"""
 ```
 More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)

 ### Powerful Complex Reasoning Abilities
 We conducted a comprehensive evaluation of Ling-flash-2.0’s reasoning capabilities, reporting strong results on representative benchmarks:
+● __Multi-disciplinary knowledge reasoning__: GPQA-Diamond, MMLU-Pro
+● __Advanced mathematical reasoning__: AIME 2025, Omni-MATH, OptMATH (advanced mathematical optimization tasks)
+● __Challenging code generation__: LiveCodeBench v6, CodeForces-Elo
+● __Logical reasoning__: KOR-Bench, ARC-Prize
+● __Key regulated industries (Finance, Healthcare)__: FinanceReasoning, HealthBench
 Compared with __dense models under 40B__ (e.g., Qwen3-32B-Non-Thinking, Seed-OSS-36B-Instruct (think budget=0)) and __larger-activation/total-parameter MoE models__ (e.g., Hunyuan-A13B-Instruct, GPT-OSS-120B/low), __Ling-flash-2.0__ demonstrates stronger complex reasoning power. Moreover, it shows high competitiveness on __creative tasks__ (Creative Writing v3).
 <p align="center">
     <img src="https://mdn.alipayobjects.com/huamei_fi95qp/afts/img/zxAvQ7QtrAwAAAAAQqAAAAgADkZ7AQFr/fmt.webp"/>
 Guided by [Ling Scaling Laws](https://arxiv.org/abs/2507.17702), Ling 2.0 adopts a __1/32 activation-ratio MoE architecture__, optimized across multiple design choices: expert granularity, shared-expert ratio, attention balance, __aux-loss-free + sigmoid routing strategy__, MTP layers, QK-Norm, Partial-RoPE, and more. These refinements enable __small-activation MoE__ models to achieve __7× efficiency gains__ over equivalent dense architectures.
 In other words, with just __6.1B activated parameters (4.8B non-embedding)__, __Ling-flash-2.0__ can match the performance of ~40B dense models. Thanks to its small activation size, it also delivers major inference speed advantages:
+● On __H20 hardware__, Ling-flash-2.0 achieves __200+ tokens/s__, offering __3× speedups__ compared to 36B dense models in everyday use.
+● With __YaRN extrapolation__, it supports __128K context length__, and as output length grows, its relative speedup can reach __7× or more__.
 <p align="center">
 ## Quickstart
 ### 🤗 Hugging Face Transformers
 Here is a code snippet to show you how to use the chat model with `transformers`:
 #### Offline Inference:
+```python
 from transformers import AutoTokenizer
 from vllm import LLM, SamplingParams
 to start command.
 - Client:
 ```shell
 curl -s http://localhost:${PORT}/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
 ```
 More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)