Instructions to use QuantTrio/GLM-4.5-Air-AWQ-FP16Mix with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use QuantTrio/GLM-4.5-Air-AWQ-FP16Mix with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="QuantTrio/GLM-4.5-Air-AWQ-FP16Mix")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("QuantTrio/GLM-4.5-Air-AWQ-FP16Mix")
model = AutoModelForCausalLM.from_pretrained("QuantTrio/GLM-4.5-Air-AWQ-FP16Mix")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use QuantTrio/GLM-4.5-Air-AWQ-FP16Mix with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "QuantTrio/GLM-4.5-Air-AWQ-FP16Mix"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/GLM-4.5-Air-AWQ-FP16Mix",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/QuantTrio/GLM-4.5-Air-AWQ-FP16Mix

SGLang

How to use QuantTrio/GLM-4.5-Air-AWQ-FP16Mix with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "QuantTrio/GLM-4.5-Air-AWQ-FP16Mix" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/GLM-4.5-Air-AWQ-FP16Mix",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "QuantTrio/GLM-4.5-Air-AWQ-FP16Mix" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/GLM-4.5-Air-AWQ-FP16Mix",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use QuantTrio/GLM-4.5-Air-AWQ-FP16Mix with Docker Model Runner:
```
docker model run hf.co/QuantTrio/GLM-4.5-Air-AWQ-FP16Mix
```

Thanks!

by lightenup - opened Aug 4, 2025

Discussion

lightenup

Aug 4, 2025

Superficial testing (Python, Javascript codegen/software engineering practices) doesn't show any performance degradation compared to https://chat.z.ai/ It's a great quantization for 96 GB VRAM!

hareram241

Aug 7, 2025

Thanks, great results on blackwell 96gb gpu , getting avg 80-90t/s with 128k context size, finally sonnet at home

bakbeest

Aug 11, 2025

Echo-ing this thanks. This model and quant is great. Any chance you might also do the 4.5V model that just released?

JunHowie

QuantTrio org Aug 12, 2025

Absolutely

JunHowie

QuantTrio org Aug 12, 2025

we are working on it. Stay tune！

rainbyte

Aug 29, 2025

I have been able to run this model with 128k context using vllm on 4x3090rtx. Thank you very much!

hareram241

Aug 29, 2025

@rainbyte wat is the tokens/second ur getting at 100k context?

rainbyte

Aug 29, 2025

@hareram241 I just tested loading part of a codebase on llm client, almost 100k context, and got this output on vllm logs:

Avg prompt throughput: 9528.4 tokens/s, Avg generation throughput: 22.6 tokens/s

Analyzing the input files took a while, and then response was half of the usual tokens/sec

Is that enough info? Should I test in some different/better way?

lsm03624

Nov 1, 2025

Which version of VLLM should be used with this quantitative model in order for it to run properly? I’m using VLLM version 0.11, but I’m getting a KeyError: layers.1.mlp.experts.w2_weight. I’ve checked each of the weight files one by one, and they are all the same as those specified in the documentation.

rainbyte

Nov 8, 2025

@lsm03624 what options are you using? Here it is working with vLLM 0.11

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment