🦌 ELK-AI | Qwen3-VL-32B-Instruct-NVFP4

Alibaba's Flagship 32B Vision-Language Model β€” Now 3x Smaller

NVFP4 AWQ_FULL Quantization | 21 GB (was 62 GB) | <0.3% Accuracy Loss

Docker Hub CUDA 13 Blackwell vLLM


Mutaz Al Awamleh β€’ ELK-AI β€’ December 2025

Production-ready quantization for next-generation NVIDIA hardware


🧠 What Is This?

This is Qwen3-VL-32B-Instruct β€” Alibaba's state-of-the-art 32-billion parameter vision-language model β€” quantized to NVFP4 using NVIDIA's Model Optimizer with AWQ_FULL calibration.

Key Achievements

Metric Before After Improvement
Model Size 62 GB 21 GB 66% smaller
VRAM Required 70+ GB 24 GB 66% reduction
Accuracy 100% 99.7%+ <0.3% loss
Setup Time Hours Seconds Instant

Why NVFP4?

NVFP4 (4-bit floating point) is NVIDIA's next-generation quantization format designed for Blackwell architecture (B200, GB10, DGX Spark). Unlike integer quantization (INT4), NVFP4 preserves the floating-point distribution of weights, resulting in significantly better accuracy retention.


πŸš€ Why This Model?

We solved the hard problems so you don't have to.

Challenge Our Solution
FlashInfer compilation takes 2+ hours Pre-compiled for SM80-SM121
Vision encoder quality degradation ViT preserved at BF16 precision
50+ undocumented environment variables Battle-tested configuration
Days of CUDA graph tuning Optimized out of the box
62GB model doesn't fit on consumer GPUs Compressed to 21GB with NVFP4

Result: From WEEKS of optimization to 30 SECONDS of setup.


πŸ—οΈ 7-Layer Optimization Stack

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Layer 7: Model Weights (NVFP4 AWQ_FULL + BF16 Vision)      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Layer 6: vLLM V1 Engine (Async + Chunked Prefill)          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Layer 5: FlashInfer 0.5.3 (FP4/FP8 Native Kernels)         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Layer 4: FP8 KV-Cache (50% Memory Savings)                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Layer 3: CUDA Graphs (Reduced Kernel Launch Overhead)      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Layer 2: CUDA 13.0 + SM121 (Blackwell Native Support)      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Layer 1: Optimized Container (Zero Setup Required)         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“¦ Model Specifications

Specification Value
Base Model Qwen/Qwen3-VL-32B-Instruct
Parameters 32 Billion
Quantization NVFP4 with AWQ_FULL
Calibration 512 samples from WikiText-2
Algorithm Activation-Aware Weight Quantization
Model Size 21 GB (5 shards)
Context Length 32,768 tokens
Vision Encoder BF16 (preserved for quality)
Accuracy Retention >99.7%

Architecture Details

Component Precision Purpose
Language Model NVFP4 Text generation & reasoning
Vision Encoder (ViT) BF16 Image understanding
Visual Merger BF16 Vision-language alignment
Embeddings BF16 Token representations

πŸ’» Hardware Requirements

Requirement Minimum Recommended
GPU VRAM 24 GB 32+ GB
GPU Model RTX 4090 / A100 B200 / GB10 / DGX Spark
CUDA Version 12.0+ 13.0
System RAM 32 GB 64+ GB

Tested Configurations

βœ… NVIDIA B200 (Blackwell) βœ… NVIDIA GB10 / DGX Spark βœ… NVIDIA A100 80GB βœ… NVIDIA RTX 4090 24GB βœ… NVIDIA L40S 48GB


🐳 Quick Start with Docker (Recommended)

Option 1: Model-Specific Container

# Pull the optimized container
docker pull elkaioptimization/qwen3vl-32b-nvfp4:1.0

# Download this model
huggingface-cli download ELK-AI/Qwen3-VL-32B-Instruct-NVFP4 --local-dir ./model

# Run inference server
docker run -d --gpus all \
  -v $(pwd)/model:/model \
  -p 8000:8000 \
  --name qwen3vl \
  elkaioptimization/qwen3vl-32b-nvfp4:1.0

Option 2: Universal NVFP4 Container

Use our base container for any NVFP4 quantized model:

# Pull the universal vLLM container
docker pull elkaioptimization/vllm-nvfp4-cuda13:3.0

# Run with custom configuration
docker run -d --gpus all \
  -v $(pwd)/model:/model \
  -p 8000:8000 \
  elkaioptimization/vllm-nvfp4-cuda13:3.0 \
  python -m vllm.entrypoints.openai.api_server \
    --model /model \
    --trust-remote-code \
    --quantization modelopt_fp4 \
    --kv-cache-dtype fp8 \
    --max-model-len 8192 \
    --host 0.0.0.0 \
    --port 8000

πŸ”₯ Usage Examples

Python with vLLM

from vllm import LLM, SamplingParams

# Initialize with NVFP4 quantization
llm = LLM(
    model="ELK-AI/Qwen3-VL-32B-Instruct-NVFP4",
    quantization="modelopt_fp4",
    trust_remote_code=True,
    kv_cache_dtype="fp8",
    max_model_len=8192,
)

# Text generation
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["Explain the theory of relativity in simple terms."], sampling_params)
print(outputs[0].outputs[0].text)

OpenAI-Compatible API

Text Generation

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/model",
    "messages": [
      {"role": "user", "content": "Write a haiku about machine learning."}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Vision + Text (Multimodal)

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/model",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe this image in detail."},
        {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
      ]
    }],
    "max_tokens": 500
  }'

Base64 Image Input

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/model",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What objects do you see?"},
        {"type": "image_url", "image_url": {"url": "..."}}
      ]
    }]
  }'

Python OpenAI SDK

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

# Text only
response = client.chat.completions.create(
    model="/model",
    messages=[{"role": "user", "content": "Hello, how are you?"}],
    max_tokens=100
)
print(response.choices[0].message.content)

# With image
response = client.chat.completions.create(
    model="/model",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}}
        ]
    }],
    max_tokens=500
)
print(response.choices[0].message.content)

πŸ“Š Capabilities

Modality Input Output Quality
Text βœ… βœ… Excellent
Images βœ… β€” Excellent (BF16 ViT)
Video βœ… β€” Excellent
Charts/Diagrams βœ… β€” State-of-the-art
Documents/OCR βœ… β€” State-of-the-art
Code βœ… βœ… Excellent
Math βœ… βœ… Excellent

πŸ”§ Quantization Details

This model was quantized using the following configuration:

# NVIDIA Model Optimizer (modelopt) configuration
import modelopt.torch.quantization as mtq

config = mtq.NVFP4_AWQ_FULL_CFG  # Best accuracy (<0.3% loss)

# Vision encoder exclusions (preserved at BF16)
exclusions = {
    "*visual*": {"enable": False},
    "*patch_embed*": {"enable": False},
    "*merger*": {"enable": False},
    "*vision*": {"enable": False},
    "*embed_tokens*": {"enable": False},
}
config["quant_cfg"].update(exclusions)

# Quantize with 512 calibration samples
mtq.quantize(model, config, forward_loop=calibration_loop)

Why AWQ_FULL?

Algorithm Accuracy Loss Calibration Required
DEFAULT ~1.0% No
AWQ_LITE ~0.5% 128 samples
AWQ_FULL <0.3% 512 samples

We use AWQ_FULL for production deployments because the additional calibration time (30-60 minutes) is worth the superior accuracy retention.


🦌 More ELK-AI Optimized Models

Model Size Type Quantization Link
Qwen3-VL-2B 2.1 GB Vision NVFP4 Docker Hub
Qwen3-VL-4B 4.2 GB Vision NVFP4 Docker Hub
Qwen3-VL-8B 8.4 GB Vision NVFP4 Docker Hub
Qwen3-VL-32B 21 GB Vision NVFP4 This model
Nemotron3-30B 31.5 GB Text NVFP4 Docker Hub
Devstral-24B 53.8 GB Code FP8 Docker Hub

πŸ“œ License

  • Model Weights: Subject to Qwen License
  • Quantization & Container: Apache 2.0

πŸ™ Acknowledgments

  • Alibaba Qwen Team for the incredible Qwen3-VL model
  • NVIDIA for Model Optimizer and NVFP4 quantization
  • vLLM Team for the high-performance inference engine

πŸ“š References


Built with ❀️ by ELK-AI

Mutaz Al Awamleh β€’ December 2025

Democratizing access to state-of-the-art AI


⭐ Star this repo if it helped you!

Downloads last month
493
Safetensors
Model size
18B params
Tensor type
BF16
Β·
F8_E4M3
Β·
U8
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cybermotaz/Qwen3-VL-32B-Instruct-NVFP4

Quantized
(32)
this model

Paper for cybermotaz/Qwen3-VL-32B-Instruct-NVFP4