Copyright (C) [2026] Advanced Micro Devices, Inc. All rights reserved. Portions of this file consist of AI generated content.
Qwen3-VL ONNX (ONNX Runtime GenAI)
Convert Qwen3-VL checkpoints to ONNX Runtime GenAI format with dynamic image-size support, then run local multimodal inference.
This README is written in a model-card style so it can be moved into a Hugging Face repo with minimal changes.
Overview
- Exports three ONNX components:
qwen3vl-vision.onnx(vision encoder, FP32)qwen3vl-embedding.onnx(image-token embedding injector, FP32)model.onnx(text decoder,fp32/fp16/int4)
- Wires
genai_config.jsonfor ONNX Runtime GenAI multimodalprocessor(...)flow. - Supports dynamic image grids through runtime
image_grid_thwinput.
Supported Checkpoints (validated locally with CPU EP)
Qwen/Qwen3-VL-2B-InstructQwen/Qwen3-VL-4B-InstructQwen/Qwen3-VL-8B-Instruct
Requirements
- Python environment with:
onnxruntime-genai(built from source or installed)transformershuggingface_hubtorch
- Local copy of
pytorch_reference/modeling_qwen3_vl.py(already in this folder).
Quickstart
Run commands from the onnxruntime-genai repo root:
cd examples/python/qwen3-vl
Download the patched modeling_qwen3_vl.py from Hugging Face and place it into
examples/python/qwen3-vl/pytorch_reference.
Also download builder.py and the inference script into examples/python/qwen3-vl:
mkdir ./pytorch_reference
hf download onnx-community/Qwen3-4B-VL-ONNX --include "modeling_qwen3_vl.py" --local-dir "./pytorch_reference"
hf download onnx-community/Qwen3-4B-VL-ONNX --include "builder.py" --local-dir "."
hf download onnx-community/Qwen3-4B-VL-ONNX --include "qwen3vl-oga-inference.py" --local-dir "."
hf download onnx-community/Qwen3-4B-VL-ONNX --include "test_images/*" --local-dir "./test_images"
1) Download a model from Hugging Face
Use either hf download (recommended) or huggingface-cli download.
Qwen3-VL-2B-Instruct
hf download Qwen/Qwen3-VL-2B-Instruct --local-dir "./pytorch_2b"
Qwen3-VL-4B-Instruct
hf download Qwen/Qwen3-VL-4B-Instruct --local-dir "./pytorch_4b"
Qwen3-VL-8B-Instruct
hf download Qwen/Qwen3-VL-8B-Instruct --local-dir "./pytorch_8b"
2) Export ONNX package
FP32 vision + FP32 text
& "python.exe" `
"builder.py" `
--input "./pytorch_4b" `
--reference "./pytorch_reference" `
--output "./qwen3-vl-4b-instruct-onnx-vision-fp32-text-fp32-cpu" `
--precision fp32
FP32 vision + INT4 text
# 2B
& "python.exe" `
"builder.py" `
--input "./pytorch_2b" `
--reference "./pytorch_reference" `
--output "./qwen3-vl-2b-instruct-onnx-vision-fp32-text-int4-cpu" `
--precision int4
# 4B
& "python.exe" `
"builder.py" `
--input "./pytorch_4b" `
--reference "./pytorch_reference" `
--output "./qwen3-vl-4b-instruct-onnx-vision-fp32-text-int4-cpu" `
--precision int4
# 8B
& "python.exe" `
"builder.py" `
--input "./pytorch_8b" `
--reference "./pytorch_reference" `
--output "./qwen3-vl-8b-instruct-onnx-vision-fp32-text-int4-cpu" `
--precision int4
3) Sanity test: text-only
Run from the same folder:
& "python.exe" `
"qwen3vl-oga-inference.py" `
-m "./qwen3-vl-8b-instruct-onnx-vision-fp32-text-int4-cpu" `
-e follow_config `
--non-interactive `
-pr "Say hello in one short sentence."
Expected behavior: model loads and returns a short greeting (for example, Hello!).
4) Sanity test: image + text
& "python.exe" `
"qwen3vl-oga-inference.py" `
-m "./qwen3-vl-8b-instruct-onnx-vision-fp32-text-int4-cpu" `
-e follow_config `
--non-interactive `
--image_paths "./test_images/img_50.jpg" `
-pr "Describe this image in one sentence."
Expected behavior: model returns a one-sentence description for the image.
Notes
- Current script usage is validated for single-image inference per call.
- If you pass multiple images in one call, you may hit:
RuntimeError: Expected pixel_values in CHW format [C, H, W], got rank 4
- Practical workaround: run one image per invocation.
Citation
If you use Qwen3-VL models, please cite the Qwen technical reports and model cards from the Qwen team.
Appendix: export patch log (precise)
This workflow uses a local reference model file at:
examples/python/qwen3-vl/pytorch_reference/modeling_qwen3_vl.py
The exporter entrypoint is:
examples/python/qwen3-vl/builder.py (not export_for_oga.py)
with --reference ./pytorch_reference to load the patched class dynamically.
A) modeling_qwen3_vl.py patches for dynamic ONNX export
Vision attention tracing path
Qwen3VLVisionAttention.forward: tracing branch bypasses Python split/loop path (lengths.tolist(),torch.split(...), per-chunk iteration) and runs tensor-only attention flow.
RoPE tracing path
Qwen3VLVisionModel.rot_pos_emb: tracing branch computes rotary positions with tensor math only and requirestoken_count.
Position embedding tracing fallback
Qwen3VLVisionModel.forward: tracing branch uses tensor-index fallback for positional embeddings instead of interpolation-heavy Python control flow.
B) Additional 4 points (explicit)
cu_seqlensdtype split for export/runtimeQwen3VLVisionModel.forward:cu_seqlensusesgrid_thw.dtypeduring tracing,torch.int32otherwise.
Tracing-safe image feature output
Qwen3VLModel.get_image_features: tracing returns raw tensors and skipstorch.split(..., split_sizes.tolist()).
Tracing-safe cache behavior
Qwen3VLTextModel.forward:DynamicCacheis not created while tracing.
Model-size portability in exporter
builder.pyembedding export no longer hardcodes hidden width (2560); it usesembed_tokens.embedding_dim, enabling 2B/4B/8B export.
These changes keep eager/runtime behavior intact while adding an export-safe tensor-only tracing path for dynamic image_grid_thw.