YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Qwen3-VL ONNX (ONNX Runtime GenAI)

Convert Qwen3-VL checkpoints to ONNX Runtime GenAI format with dynamic image-size support, then run local multimodal inference.

This README is written in a model-card style so it can be moved into a Hugging Face repo with minimal changes.

Overview

Exports three ONNX components:
- qwen3vl-vision.onnx (vision encoder, FP32)
- qwen3vl-embedding.onnx (image-token embedding injector, FP32)
- model.onnx (text decoder, fp32/fp16/int4)
Wires genai_config.json for ONNX Runtime GenAI multimodal processor(...) flow.
Supports dynamic image grids through runtime image_grid_thw input.

Supported Checkpoints (validated locally with CPU EP)

Qwen/Qwen3-VL-2B-Instruct
Qwen/Qwen3-VL-4B-Instruct
Qwen/Qwen3-VL-8B-Instruct

Requirements

Python environment with:
- onnxruntime-genai (built from source or installed)
- transformers
- huggingface_hub
- torch
Local copy of pytorch_reference/modeling_qwen3_vl.py (already in this folder).

Quickstart

Run commands from the onnxruntime-genai repo root:

cd examples/python/qwen3-vl

Download the patched modeling_qwen3_vl.py from Hugging Face and place it into examples/python/qwen3-vl/pytorch_reference.

Also download builder.py and the inference script into examples/python/qwen3-vl:

mkdir ./pytorch_reference
hf download onnx-community/Qwen3-4B-VL-ONNX --include "modeling_qwen3_vl.py" --local-dir "./pytorch_reference"
hf download onnx-community/Qwen3-4B-VL-ONNX --include "builder.py" --local-dir "."
hf download onnx-community/Qwen3-4B-VL-ONNX --include "qwen3vl-oga-inference.py" --local-dir "."
hf download onnx-community/Qwen3-4B-VL-ONNX --include "test_images/*" --local-dir "./test_images"

1) Download a model from Hugging Face

Use either hf download (recommended) or huggingface-cli download.

Qwen3-VL-2B-Instruct

hf download Qwen/Qwen3-VL-2B-Instruct --local-dir "./pytorch_2b"

Qwen3-VL-4B-Instruct

hf download Qwen/Qwen3-VL-4B-Instruct --local-dir "./pytorch_4b"

Qwen3-VL-8B-Instruct

hf download Qwen/Qwen3-VL-8B-Instruct --local-dir "./pytorch_8b"

2) Export ONNX package

FP32 vision + FP32 text

& "python.exe" `
  "builder.py" `
  --input "./pytorch_4b" `
  --reference "./pytorch_reference" `
  --output "./qwen3-vl-4b-instruct-onnx-vision-fp32-text-fp32-cpu" `
  --precision fp32

FP32 vision + INT4 text

# 2B
& "python.exe" `
  "builder.py" `
  --input "./pytorch_2b" `
  --reference "./pytorch_reference" `
  --output "./qwen3-vl-2b-instruct-onnx-vision-fp32-text-int4-cpu" `
  --precision int4

# 4B
& "python.exe" `
  "builder.py" `
  --input "./pytorch_4b" `
  --reference "./pytorch_reference" `
  --output "./qwen3-vl-4b-instruct-onnx-vision-fp32-text-int4-cpu" `
  --precision int4

# 8B
& "python.exe" `
  "builder.py" `
  --input "./pytorch_8b" `
  --reference "./pytorch_reference" `
  --output "./qwen3-vl-8b-instruct-onnx-vision-fp32-text-int4-cpu" `
  --precision int4

3) Sanity test: text-only

Run from the same folder:

& "python.exe" `
  "qwen3vl-oga-inference.py" `
  -m "./qwen3-vl-8b-instruct-onnx-vision-fp32-text-int4-cpu" `
  -e follow_config `
  --non-interactive `
  -pr "Say hello in one short sentence."

Expected behavior: model loads and returns a short greeting (for example, Hello!).

4) Sanity test: image + text

& "python.exe" `
  "qwen3vl-oga-inference.py" `
  -m "./qwen3-vl-8b-instruct-onnx-vision-fp32-text-int4-cpu" `
  -e follow_config `
  --non-interactive `
  --image_paths "./test_images/img_50.jpg" `
  -pr "Describe this image in one sentence."

Expected behavior: model returns a one-sentence description for the image.

Notes

Current script usage is validated for single-image inference per call.
If you pass multiple images in one call, you may hit:
- RuntimeError: Expected pixel_values in CHW format [C, H, W], got rank 4
Practical workaround: run one image per invocation.

Citation

If you use Qwen3-VL models, please cite the Qwen technical reports and model cards from the Qwen team.

Appendix: export patch log (precise)

This workflow uses a local reference model file at: examples/python/qwen3-vl/pytorch_reference/modeling_qwen3_vl.py

The exporter entrypoint is: examples/python/qwen3-vl/builder.py (not export_for_oga.py) with --reference ./pytorch_reference to load the patched class dynamically.

A) `modeling_qwen3_vl.py` patches for dynamic ONNX export

Vision attention tracing path
- Qwen3VLVisionAttention.forward: tracing branch bypasses Python split/loop path (lengths.tolist(), torch.split(...), per-chunk iteration) and runs tensor-only attention flow.
RoPE tracing path
- Qwen3VLVisionModel.rot_pos_emb: tracing branch computes rotary positions with tensor math only and requires token_count.
Position embedding tracing fallback
- Qwen3VLVisionModel.forward: tracing branch uses tensor-index fallback for positional embeddings instead of interpolation-heavy Python control flow.

B) Additional 4 points (explicit)

cu_seqlens dtype split for export/runtime
- Qwen3VLVisionModel.forward: cu_seqlens uses grid_thw.dtype during tracing, torch.int32 otherwise.
Tracing-safe image feature output
- Qwen3VLModel.get_image_features: tracing returns raw tensors and skips torch.split(..., split_sizes.tolist()).
Tracing-safe cache behavior
- Qwen3VLTextModel.forward: DynamicCache is not created while tracing.
Model-size portability in exporter
- builder.py embedding export no longer hardcodes hidden width (2560); it uses embed_tokens.embedding_dim, enabling 2B/4B/8B export.

These changes keep eager/runtime behavior intact while adding an export-safe tensor-only tracing path for dynamic image_grid_thw.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support