## Table of Contents
- [Latest Updates](#latest-updates)
- [Key Features](#key-features)
- [Supported Models](#supported-models)
- [How to Use](#how-to-use)
- [Install AngelSlim](#install-angelslim)
- [Quick Start](#quick-start)
- [deployment & Evaluation](#deployment)
- [Benchmark](#benchmark)
- [License](#license)
- [Citation](#citation)
- [Technical Discussion](#technical-discussion)
## 📣Latest Updates
- [25/11/05] We have released v0.2. Quantization support for new models, such as `GLM-4.6`, `Qwen3-VL` and `Qwen3-Omni`, open-sources the Eagle3 speculative decoding training framework, and updates the Diffusion model quantization tools.
- [25/09/30] We have released **SpecExit**, the reasoning early-exit algorithm: [[Paper]](http://arxiv.org/abs/2509.24248) | [[Docs]](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/spec_exit.html) | [[vLLM Code]](https://github.com/vllm-project/vllm/pull/27192)🔥🔥🔥
- [25/09/26] We have released **TEQUILA**, the ternary quantization algorithm [[Paper]](https://arxiv.org/abs/2509.23809) | [[Code]](https://github.com/Tencent/AngelSlim/tree/tequila/TernaryQuant)🔥🔥🔥
- [25/09/24] We now support the PTQ quantification of NVFP4 for the Qwen3 series models. We also opensource [Qwen3-32B-NVFP4](https://huggingface.co/AngelSlim/Qwen3-32B_nvfp4) and [Qwen3-235B-A22B-NVFP4](https://huggingface.co/AngelSlim/Qwen3-235B-A22B_nvfp4) weights.
- [25/09/01] We now support FP8 quantization of the [Hunyuan-MT-7B](https://huggingface.co/tencent/Hunyuan-MT-7B-fp8) translation model. And enabled Torch inference and Benchmark evaluation for Eagle3. And implemented support for quantization and Cache for [FLUX](https://github.com/Tencent/AngelSlim/tree/main/configs/flux). And support quantization for the [Seed-OSS](https://github.com/Tencent/AngelSlim/tree/main/configs/seed_oss).
- [25/08/06] We now support quantization for `Hunyuan 0.5B/1.8B/4B/7B` and multimodal model `Qwen2.5VL 3B/7B/32B/72B`, including `FP8/INT4` algorithms, and quantization for `DeepSeek-R1/V3` and `Kimi-K2`, including `FP8-Static` and `W4A8-FP8` algorithms. We also opensource `Hunyuan 1.8B/4B/7B` series Eagle3 model weight.
- [25/07/04] We now support quantization for `Hunyuan/Qwen2.5/Qwen3/DeepSeek-R1-Distill-Qwen` and other models, including `INT8/FP8/INT4` algorithms. We also opensource `Qwen3` series Eagle3 model weight.
Coming soon:
- [ ] Diffusion model compression support.
- [ ] Release of new algorithm for speculative sampling.
## 🌟Key Features
- **Highly Integrated**: This toolkit integrates mainstream compression algorithms into a unified framework, offering developers one-click access with exceptional ease of use.
- **Continuous Innovation**: Beyond integrating widely-used industry algorithms, we are continuously researching better compression algorithms, which will be gradually open-sourced in the future.
- **Performance-Driven**: We continuously optimize end-to-end performance in model compression workflows and algorithm deployment, such as enabling quantization of models like Qwen3-235B and DeepSeek-R1 on a single GPU.
## 💼Supported Models
### Quantization
Currently supports the following LLMs, including Hunyuan-Dense, Hunyuan-MoE, Qwen3-Dense, Qwen3-MoE, Qwen2.5, DeepSeek-R1 distilled Qwen models, and QwQ::
| Model | FP8-Dynamic | FP8-Static | INT8-Dynamic | INT4-GPTQ | INT4-AWQ |
| --------------------------------------------------------------------------------------------------------------------------- | ----------- | ---------- | ------------ | --------- | -------- |
| [Hunyuan-Dense](https://huggingface.co/collections/tencent/hunyuan-dense-model-6890632cda26b19119c9c5e7) | ✅ | ✅ | ✅ | ✅ | ✅ |
| [Hunyuan-MoE](https://huggingface.co/collections/tencent/hunyuan-a13b-685ec38e5b46321e3ea7c4be) | ✅ | ✅ | ✅ | ✅ | ✅ |
| [Qwen3-Dense](https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8) | ✅ | ✅ | ✅ | ✅ | ✅ |
| [Qwen3-MoE](https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8) | ✅ | ✅ | ✅ | ✅ | ✅ |
| [Qwen2.5](https://huggingface.co/collections/AngelSlim/qwen2-25-quant-68652d6cbdf5c0d4b1c4499a) | ✅ | ✅ | ✅ | ✅ | ✅ |
| [DeepSeek-R1-Distill-Qwen](https://huggingface.co/collections/AngelSlim/deepseek-r1-distill-quant-68652f16a9c206b030b05f7f) | ✅ | ✅ | ✅ | ✅ | ✅ |
| [QwQ](https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8) | ✅ | ✅ | ✅ | ✅ | ✅ |
### Speculative Decoding
#### Eagle3
The Eagle3 weights for the Qwen3 series model are now available.
| Qwen3 Models | Hunyuan Models |
| ----------|----------|
| ✅ [Qwen3-1.7B](https://huggingface.co/AngelSlim/Qwen3-1.7B_eagle3) |✅ [Hunyuan-1.8B-Instruct](https://huggingface.co/AngelSlim/Hunyuan-1.8B-Instruct_eagle3) |
| ✅ [Qwen3-4B](https://huggingface.co/AngelSlim/Qwen3-4B_eagle3) |✅ [Hunyuan-4B-Instruct](https://huggingface.co/AngelSlim/Hunyuan-4B-Instruct_eagle3) |
| ✅ [Qwen3-8B](https://huggingface.co/AngelSlim/Qwen3-8B_eagle3) |✅ [Hunyuan-7B-Instruct](https://huggingface.co/AngelSlim/Hunyuan-7B-Instruct_eagle3) |
| ✅ [Qwen3-14B](https://huggingface.co/AngelSlim/Qwen3-14B_eagle3) |
| ✅ [Qwen3-32B](https://huggingface.co/AngelSlim/Qwen3-32B_eagle3) |
| ✅ [Qwen3-30B-A3B](https://huggingface.co/AngelSlim/Qwen3-a3B_eagle3) |
## 🛎️How to Use
### Install AngelSlim
We recommend using `pip` to install the latest stable version of `AngelSlim`:
```shell
pip install angelslim
```
Alternatively, you can clone the repository and install from source in editable mode:
```shell
cd AngelSlim && python setup.py install
```
For more detailed installation instructions, please refer to the [Installation Documentation](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/installation.html).
### Quick Start
#### Quantization
After installing `AngelSlim`, you can quickly start by running the following script to perform static `FP8` quantization on the `Qwen3-1.7B` model:
* One-click Start
```shell
python3 tools/run.py -c configs/qwen3/fp8_static/qwen3-1_7b_fp8_static.yaml
```
This example will load the HuggingFace model and perform activation value calibration using the `dataset` specified in the config file, saving the quantized model weights.
* Code-based Start
To perform dynamic `FP8` quantization on `Qwen3-1.7B`:
```python
from angelslim.engine import Engine
slim_engine = Engine()
# Prepare model
slim_engine.prepare_model(model_name="Qwen", model_path="Qwen/Qwen3-1.7B",)
# Initialize compressor
slim_engine.prepare_compressor("PTQ", default_method="fp8_dynamic")
# Compress model
slim_engine.run()
# Save compressed model
slim_engine.save("./output")
```
For more details, please refer to the [Quick Start Documentation](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/quickstrat.html).
#### Speculative_Decoding
##### Eagle3 PyTorch Performance Testing
After installing `AngelSlim`, you can quickly start Eagle3 PyTorch performance testing with the following script:
```bash
python3 tools/spec_benchmark.py \
--base-model-path /path/to/base/model \
--eagle-model-path /path/to/eagle/model \
--model-id your_model_id \
--mode both
```
For more details, please refer to the [Quick Start Documentation](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/quickstrat.html).
### Deployment and Testing
### 1. Offline Inference
If you need to load a quantized model via `transformers`, please set the `deploy_backend: huggingface` in the `global` configuration before quantizing the model, or manually modify the `ignored_layers` field in the `config.json` file located in the quantized model output directory to `ignore`.
To test offline inference with a quantized model loaded via `transformers`, run the following command:
```shell
python deploy/offline.py $MODEL_PATH
```
Where `MODEL_PATH` is the path to the quantized model output.
#### 2. API Service Deployment
After specifying the quantized model path `MODEL_PATH`, you can deploy an OpenAI-compatible API service using the following LLMs inference frameworks:
**vLLM**
Use the following script to launch a [vLLM](https://github.com/vllm-project/vllm) server, recommended version `vllm>=0.8.5.post1`. For MOE INT8 quantized models, vllm>=0.9.0 is required.
```shell
bash deploy/run_vllm.sh $MODEL_PATH
```
**SGLang**
Use the following script to launch a [SGLang](https://github.com/sgl-project/sglang) server, recommended version `sglang>=0.4.6.post1`.
```shell
bash deploy/run_sglang.sh $MODEL_PATH
```
#### 3. Service Invocation
Invoke requests via [OpenAI's API format](https://platform.openai.com/docs/api-reference/introduction):
```shell
bash deploy/openai.sh $MODEL_PATH
```
#### 4. Performance Evaluation
Evaluate the performance of quantized model using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), recommended version`lm-eval>=0.4.8`:
```shell
bash deploy/lm_eval.sh $MODEL_PATH
```
For more detaileds, please refer to the [Deployment Documentation](https://angelslim.readthedocs.io/zh-cn/latest/deployment/deploy.html).
## 📈 Benchmark
### (1) Quantization
The performance test results for selected models are shown below. For the complete benchmark, refer to the [Benchmark documentation](https://angelslim.readthedocs.io/zh-cn/latest/performance/quantization/benchmarks.html)
#### Hunyuan Series Models
Benchmark results for the `Hunyuan-Instruct` model with `FP8`, `INT4-AWQ` and `INT4-GPTQ` quantization algorithms on datasets including`OlympiadBench`, `AIME 2024` and `DROP`:
Model
Quantization
OlympiadBench
AIME 2024
DROP
GPQA-Diamond
Hunyuan-A13B-Instruct
BF16
82.7
87.30
91.1
71.2
FP8-Static
83.0
86.7
91.1
-
Int4-GPTQ
82.7
86.7
91.1
-
Int4-AWQ
82.6
85.6
91.0
-
Hunyuan-7B-Instruct
BF16
76.5
81.1
85.9
60.1
FP8-Static
76.6
80.9
86.0
60.1
Int4-GPTQ
76.2
81.0
85.7
60.0
Int4-AWQ
76.4
80.9
85.9
60.1
Hunyuan-4B-Instruct
BF16
73.1
78.3
78.2
61.1
FP8-Static
73.1
76.6
78.3
60.2
Int4-GPTQ
72.9
-
78.1
58.1
Int4-AWQ
72.8
-
78.2
-
Hunyuan-1.8B-Instruct
BF16
63.4
56.7
76.7
47.2
FP8-Static
62.5
55.2
75.1
47.7
Int4-GPTQ
60.9
-
73.0
44.4
Int4-AWQ
61.7
-
71.7
43.6
Hunyuan-0.5B-Instruct
BF16
29.6
17.2
52.8
23.3
FP8-Static
29.6
17.2
51.6
22.5
Int4-GPTQ
26.8
-
50.9
23.3
Int4-AWQ
26.3
-
48.9
23.3
#### Qwen3 Series Models
Benchmark results for Qwen3 series models with `FP8-Static`, `FP8-Dynamic`, `INT4-GPTQ`, and `INT4-AWQ` quantization algorithms on datasets including `CEVAL`, `MMLU`, `GSM8K`, and `HUMANEVAL`:
Model
Quantization
CEVAL
MMLU
GSM8K
HUMANEVAL
Qwen3-0.6B
BF16
45.84
47.21
42.99
19.51
FP8-Static
45.99
46.87
38.06
18.90
FP8-Dynamic
45.99
46.93
38.29
20.73
INT8-Dynamic
45.17
46.95
41.17
21.34
Qwen3-8B
BF16
79.27
74.78
87.79
63.41
FP8-Static
78.23
74.79
86.96
62.20
FP8-Dynamic
78.45
74.75
87.64
62.80
INT8-Dynamic
78.01
74.84
86.96
67.07
INT4-GPTQ
77.19
73.26
86.43
62.20
INT4-AWQ
76.15
73.59
86.96
63.41
Qwen3-14B
BF16
83.06
78.90
88.40
55.49
FP8-Static
82.62
78.57
89.46
57.32
FP8-Dynamic
82.24
78.92
88.32
52.44
INT8-Dynamic
81.87
78.13
86.28
56.10
INT4-GPTQ
81.05
78.02
87.34
57.93
INT4-AWQ
82.02
77.68
84.23
61.59
Qwen3-32B
BF16
86.55
82.00
74.53
37.80
FP8-Static
86.92
81.78
70.20
39.63
FP8-Dynamic
86.55
81.89
70.43
38.41
INT4-GPTQ
86.18
81.01
-
43.29
INT4-AWQ
86.18
81.54
-
36.59
Qwen3-30B-A3B
BF16
83.66
79.36
89.99
31.71
FP8-Static
83.95
79.47
89.01
31.10
FP8-Dynamic
84.10
79.40
89.16
32.93
INT8-Dynamic
83.36
79.48
89.16
34.15
Qwen3-235B-A22B
BF16
89.60
86.28
85.29
27.44
FP8-Static
89.67
86.19
86.96
27.44
FP8-Dynamic
89.67
86.18
85.22
28.05
INT8-Dynamic
88.93
86.20
86.20
23.78
QwQ-32B
BF16
85.74
82.03
73.31
42.68
FP8-Static
85.44
81.91
75.36
42.68
FP8-Dynamic
85.07
81.93
75.66
42.07
INT4-GPTQ
84.03
81.26
68.23
45.73
INT4-AWQ
83.58
81.01
68.69
43.29
#### Qwen2.5VL Series Models
Benchmark results for Qwen2.5VL series models with `BF16`、`FP8-Static`、`FP8-Dynamic`、`INT4-GPTQ`、`INT4-AWQ` quantization algorithms on datasets including `MMMU_VAL`、`DocVQA_VAL` and `ChartQA_TEST`:
Model
Quantization
MMMU_VAL
MMLDocVQA_VALU
ChartQA_TEST
Qwen2.5VL-3B
BF16
47.11
78.57
80.32
FP8-Static
47.33
79.34
79.68
FP8-Dynamic
45.99
46.93
38.29
INT4-GPTQ
46.56
77.20
78.96
INT4-AWQ
45.78
-
79.60
Qwen2.5VL-7B
BF16
45.44
89.71
84.64
FP8-Static
47.00
89.83
85.92
FP8-Dynamic
47.22
89.80
88.64
INT4-GPTQ
46.67
90.45
-
INT4-AWQ
45.67
89.28
-
Qwen2.5VL-32B
BF16
57.00
90.03
-
FP8-Static
57.00
89.88
-
FP8-Dynamic
56.44
89.88
-
INT4-GPTQ
55.22
89.80
-
INT4-AWQ
55.22
90.30
-
Qwen2.5VL-72B
BF16
58.78
94.39
85.60
FP8-Static
57.89
94.41
85.84
FP8-Dynamic
58.67
94.38
85.60
INT4-GPTQ
57.56
94.46
86.48
INT4-AWQ
58.78
94.19
87.28
#### DeepSeek Series Models
Benchmark results for DeepSeek-R1-0528 series models with `FP8-Block-Wise` and `W4A8-FP8` quantization algorithms on datasets including `GPQA Diamond`、`AIME 2024`、`SimpleQA` and `LiveCodeBench`:
Model
Quantization
GPQA Diamond
AIME 2024
SimpleQA
LiveCodeBench
DeepSeek-R1-0528
FP8-Block-Wise
78.28
88.67
27.8
77.1
W4A8-FP8
77.37
88.67
26.83
78.86
> **Note**:
> - The above results are based on the average of 5 test runs deployed with TRT-LLM
> - The hyperparameters used during evaluation are as follows:
> ```json
>{
> "top_k": 20,
> "top_p": 0.6,
> "temperature": 0.7,
> "output_seq_len": 32768,
> "max_input_seq_len": 16384
>}
>```
#### Other Models
Benchmark results for other models with `FP8-Static`, `FP8-Dynamic`, `INT4-GPTQ`, and `INT4-AWQ` quantization algorithms on datasets including `CEVAL`, `MMLU` and `GSM8K`:
Model
Quantization
CEVAL
MMLU
GSM8K
Qwen2.5-1.5B-Instruct
BF16
67.01
60.05
54.28
FP8-Static
66.27
60.23
-
FP8-Dynamic
66.79
60.08
51.71
Qwen2.5-7B-Instruct
BF16
81.20
74.55
79.98
FP8-Static
81.13
74.03
79.30
FP8-Dynamic
80.31
74.07
79.00
INT4-GPTQ
79.05
73.05
74.75
INT4-AWQ
79.35
73.22
79.38
Qwen2.5-32B-Instruct
BF16
87.30
83.21
81.73
FP8-Static
87.59
83.08
81.58
FP8-Dynamic
87.30
83.04
81.58
INT4-GPTQ
86.70
82.45
82.03
INT4-AWQ
87.00
82.64
-
DeepSeek-R1-Distill-Qwen-7B
BF16
53.49
53.80
75.74
FP8-Static
53.57
54.17
76.19
FP8-Dynamic
52.97
54.13
74.15
INT4-GPTQ
51.86
52.44
75.89
INT4-AWQ
53.49
53.70
-
DeepSeek-R1-Distill-Qwen-14B
BF16
77.71
74.28
85.67
FP8-Static
77.56
74.66
86.73
FP8-Dynamic
76.82
74.63
87.11
INT4-GPTQ
74.29
72.37
84.61
INT4-AWQ
74.81
73.00
86.05
DeepSeek-R1-Distill-Qwen-32B
BF16
84.18
80.89
87.41
FP8-Static
83.43
80.90
87.57
FP8-Dynamic
83.73
81.10
86.43
INT4-GPTQ
84.10
79.80
86.73
INT4-AWQ
82.84
80.15
87.19
### (2) Speculative Decoding
#### Qwen3 Series Models
Benchmark results for Qwen3 series models with `Eagle3` speculative decoding algorithm on datasets including `MT-bench`, `HunmanEval`, `GSM8K`, and `Alpaca`:
 
 
MT-bench
HumanEval
GSM8K
Alpaca
Mean
Temperature
Model
Speedup
τ
Speedup
τ
Speedup
τ
Speedup
τ
Speedup
τ
T=0
Qwen3-1.7B
2.05x
2.81
2.07x
2.93
2.11x
2.98
1.93x
2.69
2.04x
2.85
Qwen3-4B
2.21x
3.01
2.36x
3.24
2.42x
3.13
2.32x
2.75
2.33x
3.03
Qwen3-8B
2.63x
3.65
2.76x
3.85
2.82x
3.90
2.62x
3.48
2.70x
3.72
Qwen3-14B
2.23x
3.30
2.53x
3.74
2.56x
3.79
2.16x
3.13
2.37x
3.49
Qwen3-32B
2.39x
2.78
2.37x
2.81
2.47x
2.92
2.42x
2.53
2.41x
2.76
Qwen3-30B-A3B
2.84x
3.63
2.27x
3.09
2.64x
3.42
2.83x
3.56
2.64x
3.42
T=1
Qwen3-1.7B
1.74x
2.53
1.86x
2.70
1.82x
2.69
1.72x
2.46
1.93x
2.60
Qwen3-4B
1.93x
2.60
2.00x
2.84
2.11x
2.82
2.34x
2.50
1.75x
2.69
Qwen3-8B
1.98x
2.75
2.25x
3.11
2.31x
3.15
2.10x
2.76
2.90x
2.94
Qwen3-14B
1.71x
2.61
1.95x
2.87
2.04x
3.08
1.68x
2.55
2.90x
2.78
Qwen3-32B
1.62x
1.91
1.71x
2.05
1.78x
2.10
1.80x
1.95
1.62x
2.00
Qwen3-30B-A3B
1.91x
2.46
2.00x
2.64
1.90x
2.53
1.80x
2.32
1.90x
2.48
#### Hunyuan Series Models
Benchmark results for Hunyuan series models with `Eagle3` speculative decoding algorithm on datasets including `MT-bench`, `HunmanEval`, `GSM8K`, and `Alpaca`:
 
 
MT-bench
HumanEval
GSM8K
Alpaca
Mean
Temperature
Model
Speedup
τ
Speedup
τ
Speedup
τ
Speedup
τ
Speedup
τ
T=0
Hunyuan-1.8B-Instruct
1.97x
2.90
2.58x
3.73
2.61x
3.71
1.71x
2.43
2.22x
3.19
Hunyuan-4B-Instruct
1.77x
2.60
2.64x
3.35
2.14x
3.17
1.72x
2.57
2.07x
2.92
Hunyuan-7B-Instruct
2.22x
3.58
3.59x
5.47
2.96x
4.68
1.64x
2.56
2.60x
4.07
T=1
Hunyuan-1.8B-Instruct
1.58x
2.36
2.35x
3.56
2.23x
3.38
1.26x
1.87
1.86x
2.79
Hunyuan-4B-Instruct
1.36x
2.05
1.97x
2.86
1.72x
2.68
1.14x
1.76
1.55x
2.34
Hunyuan-7B-Instruct
1.90x
3.11
3.12x
5.09
2.74x
4.34
1.47x
2.39
2.31x
3.73
## 📝 License
The code for this project is open-sourced under the [License for AngelSlim](LICENSE).
## 🔗 Citation
```
@software{AngelSlim2025,
title={{AngelSlim}},
author={Tencent AngelSlim Project Contributors},
year={2025},
month={6},
url={https://github.com/Tencent/AngelSlim},
}
```
## 💬 Technical Discussion
* AngelSlim is continuously iterating and new features will be released soon. If you have any questions or suggestions, please open an issue on [GitHub Issues](https://github.com/Tencent/AngelSlim/issues) or join our [WeChat technical discussion group](./docs/source/assets/angel_slim_wechat.png).