🦌 ELK-AI | Qwen3-VL-32B-Instruct-NVFP4

Alibaba's Flagship 32B Vision-Language Model — Now 3x Smaller

NVFP4 AWQ_FULL Quantization | 21 GB (was 62 GB) | <0.3% Accuracy Loss

Mutaz Al Awamleh • ELK-AI • December 2025

Production-ready quantization for next-generation NVIDIA hardware

🧠 What Is This?

This is Qwen3-VL-32B-Instruct — Alibaba's state-of-the-art 32-billion parameter vision-language model — quantized to NVFP4 using NVIDIA's Model Optimizer with AWQ_FULL calibration.

Key Achievements

Metric	Before	After	Improvement
Model Size	62 GB	21 GB	66% smaller
VRAM Required	70+ GB	24 GB	66% reduction
Accuracy	100%	99.7%+	<0.3% loss
Setup Time	Hours	Seconds	Instant

Why NVFP4?

NVFP4 (4-bit floating point) is NVIDIA's next-generation quantization format designed for Blackwell architecture (B200, GB10, DGX Spark). Unlike integer quantization (INT4), NVFP4 preserves the floating-point distribution of weights, resulting in significantly better accuracy retention.

🚀 Why This Model?

We solved the hard problems so you don't have to.

Challenge	Our Solution
FlashInfer compilation takes 2+ hours	Pre-compiled for SM80-SM121
Vision encoder quality degradation	ViT preserved at BF16 precision
50+ undocumented environment variables	Battle-tested configuration
Days of CUDA graph tuning	Optimized out of the box
62GB model doesn't fit on consumer GPUs	Compressed to 21GB with NVFP4

Result: From WEEKS of optimization to 30 SECONDS of setup.

🏗️ 7-Layer Optimization Stack

┌─────────────────────────────────────────────────────────────┐
│  Layer 7: Model Weights (NVFP4 AWQ_FULL + BF16 Vision)      │
├─────────────────────────────────────────────────────────────┤
│  Layer 6: vLLM V1 Engine (Async + Chunked Prefill)          │
├─────────────────────────────────────────────────────────────┤
│  Layer 5: FlashInfer 0.5.3 (FP4/FP8 Native Kernels)         │
├─────────────────────────────────────────────────────────────┤
│  Layer 4: FP8 KV-Cache (50% Memory Savings)                 │
├─────────────────────────────────────────────────────────────┤
│  Layer 3: CUDA Graphs (Reduced Kernel Launch Overhead)      │
├─────────────────────────────────────────────────────────────┤
│  Layer 2: CUDA 13.0 + SM121 (Blackwell Native Support)      │
├─────────────────────────────────────────────────────────────┤
│  Layer 1: Optimized Container (Zero Setup Required)         │
└─────────────────────────────────────────────────────────────┘

📦 Model Specifications

Specification	Value
Base Model	Qwen/Qwen3-VL-32B-Instruct
Parameters	32 Billion
Quantization	NVFP4 with AWQ_FULL
Calibration	512 samples from WikiText-2
Algorithm	Activation-Aware Weight Quantization
Model Size	21 GB (5 shards)
Context Length	32,768 tokens
Vision Encoder	BF16 (preserved for quality)
Accuracy Retention	>99.7%

Architecture Details

Component	Precision	Purpose
Language Model	NVFP4	Text generation & reasoning
Vision Encoder (ViT)	BF16	Image understanding
Visual Merger	BF16	Vision-language alignment
Embeddings	BF16	Token representations

💻 Hardware Requirements

Requirement	Minimum	Recommended
GPU VRAM	24 GB	32+ GB
GPU Model	RTX 4090 / A100	B200 / GB10 / DGX Spark
CUDA Version	12.0+	13.0
System RAM	32 GB	64+ GB

Tested Configurations

✅ NVIDIA B200 (Blackwell) ✅ NVIDIA GB10 / DGX Spark ✅ NVIDIA A100 80GB ✅ NVIDIA RTX 4090 24GB ✅ NVIDIA L40S 48GB

🐳 Quick Start with Docker (Recommended)

Option 1: Model-Specific Container

# Pull the optimized container
docker pull elkaioptimization/qwen3vl-32b-nvfp4:1.0

# Download this model
huggingface-cli download ELK-AI/Qwen3-VL-32B-Instruct-NVFP4 --local-dir ./model

# Run inference server
docker run -d --gpus all \
  -v $(pwd)/model:/model \
  -p 8000:8000 \
  --name qwen3vl \
  elkaioptimization/qwen3vl-32b-nvfp4:1.0

Option 2: Universal NVFP4 Container

Use our base container for any NVFP4 quantized model:

# Pull the universal vLLM container
docker pull elkaioptimization/vllm-nvfp4-cuda13:3.0

# Run with custom configuration
docker run -d --gpus all \
  -v $(pwd)/model:/model \
  -p 8000:8000 \
  elkaioptimization/vllm-nvfp4-cuda13:3.0 \
  python -m vllm.entrypoints.openai.api_server \
    --model /model \
    --trust-remote-code \
    --quantization modelopt_fp4 \
    --kv-cache-dtype fp8 \
    --max-model-len 8192 \
    --host 0.0.0.0 \
    --port 8000

🔥 Usage Examples

Python with vLLM

from vllm import LLM, SamplingParams

# Initialize with NVFP4 quantization
llm = LLM(
    model="ELK-AI/Qwen3-VL-32B-Instruct-NVFP4",
    quantization="modelopt_fp4",
    trust_remote_code=True,
    kv_cache_dtype="fp8",
    max_model_len=8192,
)

# Text generation
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["Explain the theory of relativity in simple terms."], sampling_params)
print(outputs[0].outputs[0].text)

OpenAI-Compatible API

Text Generation

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/model",
    "messages": [
      {"role": "user", "content": "Write a haiku about machine learning."}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Vision + Text (Multimodal)

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/model",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe this image in detail."},
        {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
      ]
    }],
    "max_tokens": 500
  }'

Base64 Image Input

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/model",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What objects do you see?"},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQ..."}}
      ]
    }]
  }'

Python OpenAI SDK

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

# Text only
response = client.chat.completions.create(
    model="/model",
    messages=[{"role": "user", "content": "Hello, how are you?"}],
    max_tokens=100
)
print(response.choices[0].message.content)

# With image
response = client.chat.completions.create(
    model="/model",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}}
        ]
    }],
    max_tokens=500
)
print(response.choices[0].message.content)

📊 Capabilities

Modality	Input	Output	Quality
Text	✅	✅	Excellent
Images	✅	—	Excellent (BF16 ViT)
Video	✅	—	Excellent
Charts/Diagrams	✅	—	State-of-the-art
Documents/OCR	✅	—	State-of-the-art
Code	✅	✅	Excellent
Math	✅	✅	Excellent

🔧 Quantization Details

This model was quantized using the following configuration:

# NVIDIA Model Optimizer (modelopt) configuration
import modelopt.torch.quantization as mtq

config = mtq.NVFP4_AWQ_FULL_CFG  # Best accuracy (<0.3% loss)

# Vision encoder exclusions (preserved at BF16)
exclusions = {
    "*visual*": {"enable": False},
    "*patch_embed*": {"enable": False},
    "*merger*": {"enable": False},
    "*vision*": {"enable": False},
    "*embed_tokens*": {"enable": False},
}
config["quant_cfg"].update(exclusions)

# Quantize with 512 calibration samples
mtq.quantize(model, config, forward_loop=calibration_loop)

Why AWQ_FULL?

Algorithm	Accuracy Loss	Calibration Required
DEFAULT	~1.0%	No
AWQ_LITE	~0.5%	128 samples
AWQ_FULL	<0.3%	512 samples

We use AWQ_FULL for production deployments because the additional calibration time (30-60 minutes) is worth the superior accuracy retention.

🦌 More ELK-AI Optimized Models

Model	Size	Type	Quantization	Link
Qwen3-VL-2B	2.1 GB	Vision	NVFP4	Docker Hub
Qwen3-VL-4B	4.2 GB	Vision	NVFP4	Docker Hub
Qwen3-VL-8B	8.4 GB	Vision	NVFP4	Docker Hub
Qwen3-VL-32B	21 GB	Vision	NVFP4	This model
Nemotron3-30B	31.5 GB	Text	NVFP4	Docker Hub
Devstral-24B	53.8 GB	Code	FP8	Docker Hub

📜 License

Model Weights: Subject to Qwen License
Quantization & Container: Apache 2.0

🙏 Acknowledgments

Alibaba Qwen Team for the incredible Qwen3-VL model
NVIDIA for Model Optimizer and NVFP4 quantization
vLLM Team for the high-performance inference engine

📚 References

Built with ❤️ by ELK-AI

Mutaz Al Awamleh • December 2025

Democratizing access to state-of-the-art AI

⭐ Star this repo if it helped you!

Downloads last month: 493

Safetensors

Model size

18B params

Tensor type

BF16

F8_E4M3

Model tree for cybermotaz/Qwen3-VL-32B-Instruct-NVFP4

Base model

Qwen/Qwen3-VL-32B-Instruct

Quantized

(32)

this model

Paper for cybermotaz/Qwen3-VL-32B-Instruct-NVFP4

Qwen2.5-VL Technical Report

Paper • 2502.13923 • Published Feb 19, 2025 • 213