π¦ ELK-AI | Qwen3-VL-32B-Instruct-NVFP4
Alibaba's Flagship 32B Vision-Language Model β Now 3x Smaller
NVFP4 AWQ_FULL Quantization | 21 GB (was 62 GB) | <0.3% Accuracy Loss
Mutaz Al Awamleh β’ ELK-AI β’ December 2025
Production-ready quantization for next-generation NVIDIA hardware
π§ What Is This?
This is Qwen3-VL-32B-Instruct β Alibaba's state-of-the-art 32-billion parameter vision-language model β quantized to NVFP4 using NVIDIA's Model Optimizer with AWQ_FULL calibration.
Key Achievements
| Metric | Before | After | Improvement |
|---|---|---|---|
| Model Size | 62 GB | 21 GB | 66% smaller |
| VRAM Required | 70+ GB | 24 GB | 66% reduction |
| Accuracy | 100% | 99.7%+ | <0.3% loss |
| Setup Time | Hours | Seconds | Instant |
Why NVFP4?
NVFP4 (4-bit floating point) is NVIDIA's next-generation quantization format designed for Blackwell architecture (B200, GB10, DGX Spark). Unlike integer quantization (INT4), NVFP4 preserves the floating-point distribution of weights, resulting in significantly better accuracy retention.
π Why This Model?
We solved the hard problems so you don't have to.
| Challenge | Our Solution |
|---|---|
| FlashInfer compilation takes 2+ hours | Pre-compiled for SM80-SM121 |
| Vision encoder quality degradation | ViT preserved at BF16 precision |
| 50+ undocumented environment variables | Battle-tested configuration |
| Days of CUDA graph tuning | Optimized out of the box |
| 62GB model doesn't fit on consumer GPUs | Compressed to 21GB with NVFP4 |
Result: From WEEKS of optimization to 30 SECONDS of setup.
ποΈ 7-Layer Optimization Stack
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Layer 7: Model Weights (NVFP4 AWQ_FULL + BF16 Vision) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Layer 6: vLLM V1 Engine (Async + Chunked Prefill) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Layer 5: FlashInfer 0.5.3 (FP4/FP8 Native Kernels) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Layer 4: FP8 KV-Cache (50% Memory Savings) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Layer 3: CUDA Graphs (Reduced Kernel Launch Overhead) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Layer 2: CUDA 13.0 + SM121 (Blackwell Native Support) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Layer 1: Optimized Container (Zero Setup Required) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π¦ Model Specifications
| Specification | Value |
|---|---|
| Base Model | Qwen/Qwen3-VL-32B-Instruct |
| Parameters | 32 Billion |
| Quantization | NVFP4 with AWQ_FULL |
| Calibration | 512 samples from WikiText-2 |
| Algorithm | Activation-Aware Weight Quantization |
| Model Size | 21 GB (5 shards) |
| Context Length | 32,768 tokens |
| Vision Encoder | BF16 (preserved for quality) |
| Accuracy Retention | >99.7% |
Architecture Details
| Component | Precision | Purpose |
|---|---|---|
| Language Model | NVFP4 | Text generation & reasoning |
| Vision Encoder (ViT) | BF16 | Image understanding |
| Visual Merger | BF16 | Vision-language alignment |
| Embeddings | BF16 | Token representations |
π» Hardware Requirements
| Requirement | Minimum | Recommended |
|---|---|---|
| GPU VRAM | 24 GB | 32+ GB |
| GPU Model | RTX 4090 / A100 | B200 / GB10 / DGX Spark |
| CUDA Version | 12.0+ | 13.0 |
| System RAM | 32 GB | 64+ GB |
Tested Configurations
β NVIDIA B200 (Blackwell) β NVIDIA GB10 / DGX Spark β NVIDIA A100 80GB β NVIDIA RTX 4090 24GB β NVIDIA L40S 48GB
π³ Quick Start with Docker (Recommended)
Option 1: Model-Specific Container
# Pull the optimized container
docker pull elkaioptimization/qwen3vl-32b-nvfp4:1.0
# Download this model
huggingface-cli download ELK-AI/Qwen3-VL-32B-Instruct-NVFP4 --local-dir ./model
# Run inference server
docker run -d --gpus all \
-v $(pwd)/model:/model \
-p 8000:8000 \
--name qwen3vl \
elkaioptimization/qwen3vl-32b-nvfp4:1.0
Option 2: Universal NVFP4 Container
Use our base container for any NVFP4 quantized model:
# Pull the universal vLLM container
docker pull elkaioptimization/vllm-nvfp4-cuda13:3.0
# Run with custom configuration
docker run -d --gpus all \
-v $(pwd)/model:/model \
-p 8000:8000 \
elkaioptimization/vllm-nvfp4-cuda13:3.0 \
python -m vllm.entrypoints.openai.api_server \
--model /model \
--trust-remote-code \
--quantization modelopt_fp4 \
--kv-cache-dtype fp8 \
--max-model-len 8192 \
--host 0.0.0.0 \
--port 8000
π₯ Usage Examples
Python with vLLM
from vllm import LLM, SamplingParams
# Initialize with NVFP4 quantization
llm = LLM(
model="ELK-AI/Qwen3-VL-32B-Instruct-NVFP4",
quantization="modelopt_fp4",
trust_remote_code=True,
kv_cache_dtype="fp8",
max_model_len=8192,
)
# Text generation
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["Explain the theory of relativity in simple terms."], sampling_params)
print(outputs[0].outputs[0].text)
OpenAI-Compatible API
Text Generation
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/model",
"messages": [
{"role": "user", "content": "Write a haiku about machine learning."}
],
"temperature": 0.7,
"max_tokens": 100
}'
Vision + Text (Multimodal)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/model",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image in detail."},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]
}],
"max_tokens": 500
}'
Base64 Image Input
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/model",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "What objects do you see?"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQ..."}}
]
}]
}'
Python OpenAI SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
# Text only
response = client.chat.completions.create(
model="/model",
messages=[{"role": "user", "content": "Hello, how are you?"}],
max_tokens=100
)
print(response.choices[0].message.content)
# With image
response = client.chat.completions.create(
model="/model",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}}
]
}],
max_tokens=500
)
print(response.choices[0].message.content)
π Capabilities
| Modality | Input | Output | Quality |
|---|---|---|---|
| Text | β | β | Excellent |
| Images | β | β | Excellent (BF16 ViT) |
| Video | β | β | Excellent |
| Charts/Diagrams | β | β | State-of-the-art |
| Documents/OCR | β | β | State-of-the-art |
| Code | β | β | Excellent |
| Math | β | β | Excellent |
π§ Quantization Details
This model was quantized using the following configuration:
# NVIDIA Model Optimizer (modelopt) configuration
import modelopt.torch.quantization as mtq
config = mtq.NVFP4_AWQ_FULL_CFG # Best accuracy (<0.3% loss)
# Vision encoder exclusions (preserved at BF16)
exclusions = {
"*visual*": {"enable": False},
"*patch_embed*": {"enable": False},
"*merger*": {"enable": False},
"*vision*": {"enable": False},
"*embed_tokens*": {"enable": False},
}
config["quant_cfg"].update(exclusions)
# Quantize with 512 calibration samples
mtq.quantize(model, config, forward_loop=calibration_loop)
Why AWQ_FULL?
| Algorithm | Accuracy Loss | Calibration Required |
|---|---|---|
| DEFAULT | ~1.0% | No |
| AWQ_LITE | ~0.5% | 128 samples |
| AWQ_FULL | <0.3% | 512 samples |
We use AWQ_FULL for production deployments because the additional calibration time (30-60 minutes) is worth the superior accuracy retention.
π¦ More ELK-AI Optimized Models
| Model | Size | Type | Quantization | Link |
|---|---|---|---|---|
| Qwen3-VL-2B | 2.1 GB | Vision | NVFP4 | Docker Hub |
| Qwen3-VL-4B | 4.2 GB | Vision | NVFP4 | Docker Hub |
| Qwen3-VL-8B | 8.4 GB | Vision | NVFP4 | Docker Hub |
| Qwen3-VL-32B | 21 GB | Vision | NVFP4 | This model |
| Nemotron3-30B | 31.5 GB | Text | NVFP4 | Docker Hub |
| Devstral-24B | 53.8 GB | Code | FP8 | Docker Hub |
π License
- Model Weights: Subject to Qwen License
- Quantization & Container: Apache 2.0
π Acknowledgments
- Alibaba Qwen Team for the incredible Qwen3-VL model
- NVIDIA for Model Optimizer and NVFP4 quantization
- vLLM Team for the high-performance inference engine
π References
Built with β€οΈ by ELK-AI
Mutaz Al Awamleh β’ December 2025
Democratizing access to state-of-the-art AI
β Star this repo if it helped you!
- Downloads last month
- 493
Model tree for cybermotaz/Qwen3-VL-32B-Instruct-NVFP4
Base model
Qwen/Qwen3-VL-32B-Instruct