huawei-csl
/

Qwen3-Next-80B-A3B-Instruct-3bit-SINQ

+---
+language:
+- en
+license: apache-2.0
+tags:
+- quantization
+- sinq
+- int3
+- efficient-inference
+- text-generation
+- qwen
+- llm
+- compression
+base_model:
+- Qwen/Qwen3-Next-80B-A3B-Instruct
+base_model_relation: quantized
+---
+<p align="center">
+  <img src="logo.png" alt="Logo" style="max-width: 80%; height: auto;">
+</p>
+<p align="center">🐙 <a href="https://github.com/huawei-csl/SINQ">Github</a>&nbsp;&nbsp; | &nbsp;&nbsp;📄 <a href="http://arxiv.org/abs/2509.22944">Paper</a></p>
+# SINQ 3-bit Quantized Qwen3-Next 80B model
+This repository contains the official **3-bit quantized** version of the [`Qwen3-Next-80B`](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct) model using the **SINQ (Sinkhorn-Normalized Quantization)** method.
+SINQ is a novel, fast and high-quality quantization method designed to make any Large Language Models smaller while keeping their accuracy almost intact.
+To support the project please put a star ⭐ in the official [SINQ](https://github.com/huawei-csl/SINQ) github repository.
+## Model Details
+- **Model Name:** `Qwen3-Next-80B-A3B-Instruct-4bit-SINQ`
+- **Base Model:** [`Qwen3-Next-80B`](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct)
+- **Task:** Text Generation
+- **Framework:** PyTorch / Transformers
+- **License:** [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)
+- **Quantized By:** *Huawei - Computing Systems Lab*
+## Quantization Details
+- **Quantization Method:**  SINQ (Sinkhorn-Normalized Quantization)
+- **Precision:** INT3
+- **Group Size:**  64
+- **Framework:**  PyTorch
+- **Quantization Library:**  `sinq`
+---
+# 🚀 Usage</span>
+## Prerequisite
+Before running the quantization script, make sure the **SINQ** library is installed.
+Installation instructions and setup details are available in the [SINQ official github repository](https://github.com/huawei-csl/SINQ).
+## Usage example
+You can load and use the model with our wrapper based on the 🤗 Transformers library:
+```python
+from transformers import AutoTokenizer
+from sinq.patch_model import AutoSINQHFModel
+model_name = "huawei-csl/Qwen3-Next-80B-A3B-Instruct-3bit-SINQ"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+sinq_model = AutoSINQHFModel.from_quantized_safetensors(
+    model_name,
+    device="cuda:0",
+    compute_dtype=torch.bfloat16
+)
+# prepare the model input
+prompt = "Explain neural network quantization in one sentence."
+messages = [
+    {"role": "user", "content": prompt},
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(sinq_model.device)
+# conduct text completion
+generated_ids = sinq_model.generate(
+    **model_inputs,
+    temperature=0.7,
+    top_p=0.8,
+    top_k=20,
+    min_p=0.0,
+    max_new_tokens=16384,
+)
+output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
+content = tokenizer.decode(output_ids, skip_special_tokens=True)
+print("content:", content)
+```
+<details>
+<summary><span style="font-size:1.1em; font-weight:bold;">🧩 Quantization Process</span></summary>
+The quantized model was obtained using the **SINQ** quantization library, following the steps below:
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from sinq.patch_model import AutoSINQHFModel
+from sinq.sinqlinear import BaseQuantizeConfig
+# Load base model
+base_model_name = "Qwen/Qwen3-Next-80B-A3B-Instruct"
+model = AutoModelForCausalLM.from_pretrained(base_model_name, torch_dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained(base_model_name)
+# Apply 3-bit SINQ quantization
+quant_cfg = BaseQuantizeConfig(
+    nbits=3,            # quantization bit-width
+    group_size=64,     # group size
+    tiling_mode="1D",   # tiling strategy
+    method="sinq"       # quantization method ("asinq" for the calibrated version)
+)
+sinq_model = AutoSINQHFModel.quantize_model(
+    model,
+    tokenizer=tokenizer,
+    quant_config=quant_cfg,
+    compute_dtype=torch.bfloat16,
+    device="cuda:0"
+)
+```
+> **Reproducibility Note**: This model was quantized using the SINQ implementation from commit [`ee1dc76`](https://github.com/huawei-csl/SINQ/commit/ee1dc767ba6dc4b819841c3f89be2f50719aa72d) of the [SINQ](https://github.com/huawei-csl/SINQ) repository.
+</details>
+</br>
+---
+# 🧾 How to Cite This Work
+If you find **SINQ** useful in your research or applications, please
+- Put a star ⭐ in the official [SINQ](https://github.com/huawei-csl/SINQ) github repository.
+- Cite our <a href="http://arxiv.org/abs/2509.22944" target="_blank"><strong>paper</strong></a>:
+```bibtex
+@misc{muller2025sinq,
+      title={SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights},
+      author={Lorenz K. Muller and Philippe Bich and Jiawei Zhuang and Ahmet Celik and Luca Benfenati and Lukas Cavigelli},
+      year={2025},
+      eprint={2509.22944},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG},
+      url={http://arxiv.org/abs/2509.22944}
+}
+```