AngelSlim

Dedicated to building a more intuitive, comprehensive, and efficient LLMs compression toolkit.

📖 Documentation   |   🤗 Hugging Face   |   🤖 ModelScope   |   💬 WeChat

## Table of Contents - [Latest Updates](#latest-updates) - [Key Features](#key-features) - [Supported Models](#supported-models) - [How to Use](#how-to-use) - [Install AngelSlim](#install-angelslim) - [Quick Start](#quick-start) - [deployment & Evaluation](#deployment) - [Benchmark](#benchmark) - [License](#license) - [Citation](#citation) - [Technical Discussion](#technical-discussion) ## 📣Latest Updates - [25/11/05] We have released v0.2. Quantization support for new models, such as `GLM-4.6`, `Qwen3-VL` and `Qwen3-Omni`, open-sources the Eagle3 speculative decoding training framework, and updates the Diffusion model quantization tools. - [25/09/30] We have released **SpecExit**, the reasoning early-exit algorithm: [[Paper]](http://arxiv.org/abs/2509.24248) | [[Docs]](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/spec_exit.html) | [[vLLM Code]](https://github.com/vllm-project/vllm/pull/27192)🔥🔥🔥 - [25/09/26] We have released **TEQUILA**, the ternary quantization algorithm [[Paper]](https://arxiv.org/abs/2509.23809) | [[Code]](https://github.com/Tencent/AngelSlim/tree/tequila/TernaryQuant)🔥🔥🔥 - [25/09/24] We now support the PTQ quantification of NVFP4 for the Qwen3 series models. We also opensource [Qwen3-32B-NVFP4](https://huggingface.co/AngelSlim/Qwen3-32B_nvfp4) and [Qwen3-235B-A22B-NVFP4](https://huggingface.co/AngelSlim/Qwen3-235B-A22B_nvfp4) weights. - [25/09/01] We now support ​FP8 quantization​ of the [Hunyuan-MT-7B](https://huggingface.co/tencent/Hunyuan-MT-7B-fp8) translation model. And enabled ​Torch inference and Benchmark evaluation​ for Eagle3. And implemented support for ​quantization and Cache​ for [FLUX](https://github.com/Tencent/AngelSlim/tree/main/configs/flux). And support ​quantization​ for the [Seed-OSS](https://github.com/Tencent/AngelSlim/tree/main/configs/seed_oss). - [25/08/06] We now support quantization for `Hunyuan 0.5B/1.8B/4B/7B` and multimodal model `Qwen2.5VL 3B/7B/32B/72B`, including `FP8/INT4` algorithms, and quantization for `DeepSeek-R1/V3` and `Kimi-K2`, including `FP8-Static` and `W4A8-FP8` algorithms. We also opensource `Hunyuan 1.8B/4B/7B` series Eagle3 model weight. - [25/07/04] We now support quantization for `Hunyuan/Qwen2.5/Qwen3/DeepSeek-R1-Distill-Qwen` and other models, including `INT8/FP8/INT4` algorithms. We also opensource `Qwen3` series Eagle3 model weight. Coming soon: - [ ] Diffusion model compression support. - [ ] Release of new algorithm for speculative sampling. ## 🌟Key Features - **Highly Integrated**: This toolkit integrates mainstream compression algorithms into a unified framework, offering developers one-click access with exceptional ease of use. - **Continuous Innovation**: Beyond integrating widely-used industry algorithms, we are continuously researching better compression algorithms, which will be gradually open-sourced in the future. - **Performance-Driven**: We continuously optimize end-to-end performance in model compression workflows and algorithm deployment, such as enabling quantization of models like Qwen3-235B and DeepSeek-R1 on a single GPU. ## 💼Supported Models ### Quantization Currently supports the following LLMs, including Hunyuan-Dense, Hunyuan-MoE, Qwen3-Dense, Qwen3-MoE, Qwen2.5, DeepSeek-R1 distilled Qwen models, and QwQ:: | Model | FP8-Dynamic | FP8-Static | INT8-Dynamic | INT4-GPTQ | INT4-AWQ | | --------------------------------------------------------------------------------------------------------------------------- | ----------- | ---------- | ------------ | --------- | -------- | | [Hunyuan-Dense](https://huggingface.co/collections/tencent/hunyuan-dense-model-6890632cda26b19119c9c5e7) | ✅ | ✅ | ✅ | ✅ | ✅ | | [Hunyuan-MoE](https://huggingface.co/collections/tencent/hunyuan-a13b-685ec38e5b46321e3ea7c4be) | ✅ | ✅ | ✅ | ✅ | ✅ | | [Qwen3-Dense](https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8) | ✅ | ✅ | ✅ | ✅ | ✅ | | [Qwen3-MoE](https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8) | ✅ | ✅ | ✅ | ✅ | ✅ | | [Qwen2.5](https://huggingface.co/collections/AngelSlim/qwen2-25-quant-68652d6cbdf5c0d4b1c4499a) | ✅ | ✅ | ✅ | ✅ | ✅ | | [DeepSeek-R1-Distill-Qwen](https://huggingface.co/collections/AngelSlim/deepseek-r1-distill-quant-68652f16a9c206b030b05f7f) | ✅ | ✅ | ✅ | ✅ | ✅ | | [QwQ](https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8) | ✅ | ✅ | ✅ | ✅ | ✅ | ### Speculative Decoding #### Eagle3 The Eagle3 weights for the Qwen3 series model are now available. | Qwen3 Models | Hunyuan Models | | ----------|----------| | ✅ [Qwen3-1.7B](https://huggingface.co/AngelSlim/Qwen3-1.7B_eagle3) |✅ [Hunyuan-1.8B-Instruct](https://huggingface.co/AngelSlim/Hunyuan-1.8B-Instruct_eagle3) | | ✅ [Qwen3-4B](https://huggingface.co/AngelSlim/Qwen3-4B_eagle3) |✅ [Hunyuan-4B-Instruct](https://huggingface.co/AngelSlim/Hunyuan-4B-Instruct_eagle3) | | ✅ [Qwen3-8B](https://huggingface.co/AngelSlim/Qwen3-8B_eagle3) |✅ [Hunyuan-7B-Instruct](https://huggingface.co/AngelSlim/Hunyuan-7B-Instruct_eagle3) | | ✅ [Qwen3-14B](https://huggingface.co/AngelSlim/Qwen3-14B_eagle3) | | ✅ [Qwen3-32B](https://huggingface.co/AngelSlim/Qwen3-32B_eagle3) | | ✅ [Qwen3-30B-A3B](https://huggingface.co/AngelSlim/Qwen3-a3B_eagle3) | ## 🛎️How to Use ### Install AngelSlim We recommend using `pip` to install the latest stable version of `AngelSlim`: ```shell pip install angelslim ``` Alternatively, you can clone the repository and install from source in editable mode: ```shell cd AngelSlim && python setup.py install ``` For more detailed installation instructions, please refer to the [Installation Documentation](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/installation.html). ### Quick Start #### Quantization After installing `AngelSlim`, you can quickly start by running the following script to perform static `FP8` quantization on the `Qwen3-1.7B` model: * One-click Start ```shell python3 tools/run.py -c configs/qwen3/fp8_static/qwen3-1_7b_fp8_static.yaml ``` This example will load the HuggingFace model and perform activation value calibration using the `dataset` specified in the config file, saving the quantized model weights. * Code-based Start To perform dynamic `FP8` quantization on `Qwen3-1.7B`: ```python from angelslim.engine import Engine slim_engine = Engine() # Prepare model slim_engine.prepare_model(model_name="Qwen", model_path="Qwen/Qwen3-1.7B",) # Initialize compressor slim_engine.prepare_compressor("PTQ", default_method="fp8_dynamic") # Compress model slim_engine.run() # Save compressed model slim_engine.save("./output") ``` For more details, please refer to the [Quick Start Documentation](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/quickstrat.html). #### Speculative_Decoding ##### Eagle3 PyTorch Performance Testing After installing `AngelSlim`, you can quickly start Eagle3 PyTorch performance testing with the following script: ```bash python3 tools/spec_benchmark.py \ --base-model-path /path/to/base/model \ --eagle-model-path /path/to/eagle/model \ --model-id your_model_id \ --mode both ``` For more details, please refer to the [Quick Start Documentation](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/quickstrat.html). ### Deployment and Testing ### 1. Offline Inference If you need to load a quantized model via `transformers`, please set the `deploy_backend: huggingface` in the `global` configuration before quantizing the model, or manually modify the `ignored_layers` field in the `config.json` file located in the quantized model output directory to `ignore`. To test offline inference with a quantized model loaded via `transformers`, run the following command: ```shell python deploy/offline.py $MODEL_PATH ``` Where `MODEL_PATH` is the path to the quantized model output. #### 2. API Service Deployment After specifying the quantized model path `MODEL_PATH`, you can deploy an OpenAI-compatible API service using the following LLMs inference frameworks: **vLLM** Use the following script to launch a [vLLM](https://github.com/vllm-project/vllm) server, recommended version `vllm>=0.8.5.post1`. For MOE INT8 quantized models, vllm>=0.9.0 is required. ```shell bash deploy/run_vllm.sh $MODEL_PATH ``` **SGLang** Use the following script to launch a [SGLang](https://github.com/sgl-project/sglang) server, recommended version `sglang>=0.4.6.post1`. ```shell bash deploy/run_sglang.sh $MODEL_PATH ``` #### 3. Service Invocation Invoke requests via [OpenAI's API format](https://platform.openai.com/docs/api-reference/introduction): ```shell bash deploy/openai.sh $MODEL_PATH ``` #### 4. Performance Evaluation Evaluate the performance of quantized model using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), recommended version`lm-eval>=0.4.8`: ```shell bash deploy/lm_eval.sh $MODEL_PATH ``` For more detaileds, please refer to the [Deployment Documentation](https://angelslim.readthedocs.io/zh-cn/latest/deployment/deploy.html). ## 📈 Benchmark ### (1) Quantization The performance test results for selected models are shown below. For the complete benchmark, refer to the [Benchmark documentation](https://angelslim.readthedocs.io/zh-cn/latest/performance/quantization/benchmarks.html) #### Hunyuan Series Models Benchmark results for the `Hunyuan-Instruct` model with `FP8`, `INT4-AWQ` and `INT4-GPTQ` quantization algorithms on datasets including`OlympiadBench`, `AIME 2024` and `DROP`:
ModelQuantizationOlympiadBenchAIME 2024DROPGPQA-Diamond
Hunyuan-A13B-Instruct BF1682.787.3091.171.2
FP8-Static83.086.791.1-
Int4-GPTQ82.786.791.1-
Int4-AWQ82.685.691.0-
Hunyuan-7B-Instruct BF16 76.581.185.960.1
FP8-Static76.680.986.060.1
Int4-GPTQ76.281.085.760.0
Int4-AWQ76.480.985.960.1
Hunyuan-4B-Instruct BF16 73.178.378.261.1
FP8-Static73.176.678.360.2
Int4-GPTQ72.9-78.158.1
Int4-AWQ72.8-78.2-
Hunyuan-1.8B-Instruct BF16 63.456.776.747.2
FP8-Static62.555.275.147.7
Int4-GPTQ60.9-73.044.4
Int4-AWQ61.7-71.743.6
Hunyuan-0.5B-Instruct BF16 29.617.252.823.3
FP8-Static29.617.251.622.5
Int4-GPTQ26.8-50.923.3
Int4-AWQ26.3-48.923.3
#### Qwen3 Series Models Benchmark results for Qwen3 series models with `FP8-Static`, `FP8-Dynamic`, `INT4-GPTQ`, and `INT4-AWQ` quantization algorithms on datasets including `CEVAL`, `MMLU`, `GSM8K`, and `HUMANEVAL`:
ModelQuantizationCEVALMMLUGSM8KHUMANEVAL
Qwen3-0.6BBF1645.8447.2142.9919.51
FP8-Static45.9946.8738.0618.90
FP8-Dynamic45.9946.9338.2920.73
INT8-Dynamic45.1746.9541.1721.34
Qwen3-8BBF1679.2774.7887.7963.41
FP8-Static78.2374.7986.9662.20
FP8-Dynamic78.4574.7587.6462.80
INT8-Dynamic78.0174.8486.9667.07
INT4-GPTQ77.1973.2686.4362.20
INT4-AWQ76.1573.5986.9663.41
Qwen3-14BBF1683.0678.9088.4055.49
FP8-Static82.6278.5789.4657.32
FP8-Dynamic82.2478.9288.3252.44
INT8-Dynamic81.8778.1386.2856.10
INT4-GPTQ81.0578.0287.3457.93
INT4-AWQ82.0277.6884.2361.59
Qwen3-32BBF1686.5582.0074.5337.80
FP8-Static86.9281.7870.2039.63
FP8-Dynamic86.5581.8970.4338.41
INT4-GPTQ86.1881.01-43.29
INT4-AWQ86.1881.54-36.59
Qwen3-30B-A3BBF1683.6679.3689.9931.71
FP8-Static83.9579.4789.0131.10
FP8-Dynamic84.1079.4089.1632.93
INT8-Dynamic83.3679.4889.1634.15
Qwen3-235B-A22BBF1689.6086.2885.2927.44
FP8-Static89.6786.1986.9627.44
FP8-Dynamic89.6786.1885.2228.05
INT8-Dynamic88.9386.2086.2023.78
QwQ-32BBF1685.7482.0373.3142.68
FP8-Static85.4481.9175.3642.68
FP8-Dynamic85.0781.9375.6642.07
INT4-GPTQ84.0381.2668.2345.73
INT4-AWQ83.5881.0168.6943.29
#### Qwen2.5VL Series Models Benchmark results for Qwen2.5VL series models with `BF16`、`FP8-Static`、`FP8-Dynamic`、`INT4-GPTQ`、`INT4-AWQ` quantization algorithms on datasets including `MMMU_VAL`、`DocVQA_VAL` and `ChartQA_TEST`:
ModelQuantizationMMMU_VALMMLDocVQA_VALUChartQA_TEST
Qwen2.5VL-3BBF1647.1178.5780.32
FP8-Static47.3379.3479.68
FP8-Dynamic45.9946.9338.29
INT4-GPTQ46.5677.2078.96
INT4-AWQ45.78-79.60
Qwen2.5VL-7BBF1645.4489.7184.64
FP8-Static47.0089.8385.92
FP8-Dynamic47.2289.8088.64
INT4-GPTQ46.6790.45-
INT4-AWQ45.6789.28-
Qwen2.5VL-32BBF1657.0090.03-
FP8-Static57.0089.88-
FP8-Dynamic56.4489.88-
INT4-GPTQ55.2289.80 -
INT4-AWQ55.2290.30-
Qwen2.5VL-72BBF1658.7894.3985.60
FP8-Static57.8994.4185.84
FP8-Dynamic58.6794.3885.60
INT4-GPTQ57.5694.4686.48
INT4-AWQ58.7894.1987.28
#### DeepSeek Series Models Benchmark results for DeepSeek-R1-0528 series models with `FP8-Block-Wise` and `W4A8-FP8` quantization algorithms on datasets including `GPQA Diamond`、`AIME 2024`、`SimpleQA` and `LiveCodeBench`:
ModelQuantizationGPQA DiamondAIME 2024SimpleQALiveCodeBench
DeepSeek-R1-0528FP8-Block-Wise78.2888.6727.877.1
W4A8-FP877.3788.6726.8378.86
> **Note**: > - The above results are based on the average of 5 test runs deployed with TRT-LLM > - The hyperparameters used during evaluation are as follows: > ```json >{ > "top_k": 20, > "top_p": 0.6, > "temperature": 0.7, > "output_seq_len": 32768, > "max_input_seq_len": 16384 >} >``` #### Other Models Benchmark results for other models with `FP8-Static`, `FP8-Dynamic`, `INT4-GPTQ`, and `INT4-AWQ` quantization algorithms on datasets including `CEVAL`, `MMLU` and `GSM8K`:
ModelQuantizationCEVALMMLUGSM8K
Qwen2.5-1.5B-InstructBF1667.0160.0554.28
FP8-Static66.2760.23-
FP8-Dynamic66.7960.0851.71
Qwen2.5-7B-InstructBF1681.2074.5579.98
FP8-Static81.1374.0379.30
FP8-Dynamic80.3174.0779.00
INT4-GPTQ79.0573.0574.75
INT4-AWQ79.3573.2279.38
Qwen2.5-32B-InstructBF1687.3083.2181.73
FP8-Static87.5983.0881.58
FP8-Dynamic87.3083.0481.58
INT4-GPTQ86.7082.4582.03
INT4-AWQ87.0082.64-
DeepSeek-R1-Distill-Qwen-7BBF1653.4953.8075.74
FP8-Static53.5754.1776.19
FP8-Dynamic52.9754.1374.15
INT4-GPTQ51.8652.4475.89
INT4-AWQ53.4953.70-
DeepSeek-R1-Distill-Qwen-14BBF1677.7174.2885.67
FP8-Static77.5674.6686.73
FP8-Dynamic76.8274.6387.11
INT4-GPTQ74.2972.3784.61
INT4-AWQ74.8173.0086.05
DeepSeek-R1-Distill-Qwen-32BBF1684.1880.8987.41
FP8-Static83.4380.9087.57
FP8-Dynamic83.7381.1086.43
INT4-GPTQ84.1079.8086.73
INT4-AWQ82.8480.1587.19
### (2) Speculative Decoding #### Qwen3 Series Models Benchmark results for Qwen3 series models with `Eagle3` speculative decoding algorithm on datasets including `MT-bench`, `HunmanEval`, `GSM8K`, and `Alpaca`:
   MT-bench HumanEval GSM8K Alpaca Mean
TemperatureModelSpeedupτSpeedupτSpeedupτSpeedupτSpeedupτ
T=0 Qwen3-1.7B2.05x2.812.07x2.932.11x2.981.93x2.692.04x2.85
Qwen3-4B2.21x3.012.36x3.242.42x3.132.32x2.752.33x3.03
Qwen3-8B2.63x3.652.76x3.852.82x3.902.62x3.482.70x3.72
Qwen3-14B2.23x3.302.53x3.742.56x3.792.16x3.132.37x3.49
Qwen3-32B2.39x2.782.37x2.812.47x2.922.42x2.532.41x2.76
Qwen3-30B-A3B2.84x3.632.27x3.092.64x3.422.83x3.562.64x3.42
T=1 Qwen3-1.7B1.74x2.531.86x2.701.82x2.691.72x2.461.93x2.60
Qwen3-4B1.93x2.602.00x2.842.11x2.822.34x2.501.75x2.69
Qwen3-8B1.98x2.752.25x3.112.31x3.152.10x2.762.90x2.94
Qwen3-14B1.71x2.611.95x2.872.04x3.081.68x2.552.90x2.78
Qwen3-32B1.62x1.911.71x2.051.78x2.101.80x1.951.62x2.00
Qwen3-30B-A3B1.91x2.462.00x2.641.90x2.531.80x2.321.90x2.48
#### Hunyuan Series Models Benchmark results for Hunyuan series models with `Eagle3` speculative decoding algorithm on datasets including `MT-bench`, `HunmanEval`, `GSM8K`, and `Alpaca`:
   MT-bench HumanEval GSM8K Alpaca Mean
TemperatureModelSpeedupτSpeedupτSpeedupτSpeedupτSpeedupτ
T=0 Hunyuan-1.8B-Instruct1.97x2.902.58x3.732.61x3.711.71x2.432.22x3.19
Hunyuan-4B-Instruct1.77x2.602.64x3.352.14x3.171.72x2.572.07x2.92
Hunyuan-7B-Instruct2.22x3.583.59x5.472.96x4.681.64x2.562.60x4.07
T=1 Hunyuan-1.8B-Instruct1.58x2.362.35x3.562.23x3.381.26x1.871.86x2.79
Hunyuan-4B-Instruct1.36x2.051.97x2.861.72x2.681.14x1.761.55x2.34
Hunyuan-7B-Instruct1.90x3.113.12x5.092.74x4.341.47x2.392.31x3.73
## 📝 License The code for this project is open-sourced under the [License for AngelSlim](LICENSE). ## 🔗 Citation ``` @software{AngelSlim2025, title={{AngelSlim}}, author={Tencent AngelSlim Project Contributors}, year={2025}, month={6}, url={https://github.com/Tencent/AngelSlim}, } ``` ## 💬 Technical Discussion * AngelSlim is continuously iterating and new features will be released soon. If you have any questions or suggestions, please open an issue on [GitHub Issues](https://github.com/Tencent/AngelSlim/issues) or join our [WeChat technical discussion group](./docs/source/assets/angel_slim_wechat.png).