Instructions to use mconcat/Qwopus3.5-27B-v3-FP8-Dynamic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mconcat/Qwopus3.5-27B-v3-FP8-Dynamic with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="mconcat/Qwopus3.5-27B-v3-FP8-Dynamic") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("mconcat/Qwopus3.5-27B-v3-FP8-Dynamic") model = AutoModelForImageTextToText.from_pretrained("mconcat/Qwopus3.5-27B-v3-FP8-Dynamic") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use mconcat/Qwopus3.5-27B-v3-FP8-Dynamic with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "mconcat/Qwopus3.5-27B-v3-FP8-Dynamic" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mconcat/Qwopus3.5-27B-v3-FP8-Dynamic", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/mconcat/Qwopus3.5-27B-v3-FP8-Dynamic
- SGLang
How to use mconcat/Qwopus3.5-27B-v3-FP8-Dynamic with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "mconcat/Qwopus3.5-27B-v3-FP8-Dynamic" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mconcat/Qwopus3.5-27B-v3-FP8-Dynamic", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "mconcat/Qwopus3.5-27B-v3-FP8-Dynamic" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mconcat/Qwopus3.5-27B-v3-FP8-Dynamic", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use mconcat/Qwopus3.5-27B-v3-FP8-Dynamic with Docker Model Runner:
docker model run hf.co/mconcat/Qwopus3.5-27B-v3-FP8-Dynamic
Qwopus3.5-27B-v3-FP8-Dynamic
FP8 Dynamic quantized version of Jackrong/Qwopus3.5-27B-v3.
This checkpoint preserves the hybrid Qwen3.5 DeltaNet + softmax architecture and MTP (Multi-Token Prediction) head from the BF16 source, quantizing most linear layers to FP8 W8A8 while keeping the most sensitive projections and sidecar components in BF16.
Verified Inference
Local export and sanity-check evaluation were verified on 2026-04-07 on a single NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB) with:
transformers==5.3.0llm-compressor==0.14.1.dev24vllm==0.17.1
What was verified:
- FP8 export completed successfully via llm-compressor
- MTP weights are included in the main safetensors file
- The checkpoint loads in vLLM and generates correct output
- Quick perplexity sanity check: 7.67 (FineWeb-Edu, 50 samples)
Quantization Strategy
Uniform FP8_DYNAMIC quantization using llm-compressor:
| Precision | Layers |
|---|---|
| FP8 W8A8 | most Linear layers (per-channel static weight scales, per-token dynamic input scales) |
| BF16 | lm_head, embed_tokens, self_attn.o_proj, DeltaNet linear_attn.out_proj, DeltaNet in_proj_a/in_proj_b, visual encoder, MTP sidecar |
Architecture match with the BF16 source:
model_type=qwen3_564text layers (hybrid DeltaNet + softmax,full_attention_interval=4)mtp_num_hidden_layers=1max_position_embeddings=262144hidden_size=5120,intermediate_size=17408vocab_size=248320
Usage
vLLM
pip install -U vllm>=0.17.0 transformers>=5.3.0
Standard serving:
vllm serve mconcat/Qwopus3.5-27B-v3-FP8-Dynamic \
--max-model-len 32768 \
--gpu-memory-utilization 0.85 \
--max-num-seqs 1 \
--skip-mm-profiling \
--reasoning-parser qwen3
With MTP speculative decoding:
vllm serve mconcat/Qwopus3.5-27B-v3-FP8-Dynamic \
--max-model-len 32768 \
--gpu-memory-utilization 0.85 \
--max-num-seqs 1 \
--skip-mm-profiling \
--reasoning-parser qwen3 \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'
Transformers
from transformers import AutoTokenizer, Qwen3_5ForConditionalGeneration
import torch
model = Qwen3_5ForConditionalGeneration.from_pretrained(
"mconcat/Qwopus3.5-27B-v3-FP8-Dynamic",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"mconcat/Qwopus3.5-27B-v3-FP8-Dynamic",
trust_remote_code=True,
)
Compatibility
| Framework | Supported | Notes |
|---|---|---|
| vLLM >= 0.17.0 | Yes | Verified with vllm==0.17.1 on Blackwell; MTP works |
| transformers >= 5.3.0 | Yes | Direct loading with device_map="auto" |
| SGLang | Unknown | Not verified |
Notes
- This export keeps
self_attn.o_projand DeltaNetlinear_attn.out_projin BF16 to preserve output projection fidelity. - MTP weights are embedded in the main
model.safetensorsfile (no separatemodel.mtp.safetensors). - The model includes a vision encoder (loaded but unused for text-only inference). Use
--skip-mm-profilingwith vLLM to skip vision encoder profiling. - Blackwell (SM120) note: If you encounter TMA-related crashes, apply the one-line vLLM patch to disable TMA on Blackwell: change
>= 9to9 <= x < 12invllm/model_executor/layers/fla/ops/utils.py. - KV cache: Do not use
--kv-cache-dtype fp8_e4m3with this model family — the checkpoint lacks calibrated KV scales and will produce degraded output. Use the default BF16 KV cache.
- Downloads last month
- 56