---
license: mit
language:
- en
- zh
base_model:
- Qwen/Qwen3-VL-2B-Instruct
- Qwen/Qwen3-VL-4B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- Qwen3-VL
- Qwen3-VL-2B-Instruct
- Qwen3-VL-4B-Instruct
- Int4
- VLM
- GPTQ
---

# Qwen3-VL-4B-Instruct-GPTQ-Int4

This version of Qwen3-VL-4B-Instruct have been converted to run on the Axera NPU using **w4a16** quantization. 

Compatible with Pulsar2 version: 5.0

## Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo : 

- https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct
- https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct

[Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html) 

[AXera NPU HOST LLM Runtime](https://github.com/AXERA-TECH/Qwen3-VL.AXERA) 


## Support Platform

- AX650
  - AX650N DEMO Board
  - [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
  - [M.2 Accelerator card](https://docs.m5stack.com/zh_CN/ai_hardware/LLM-8850_Card)

**Image Process**
|Chips| input size | image num | image encoder | ttft(168 tokens) | w4a16 | CMM | Flash |
|--|--|--|--|--|--|--|--|
|AX650| 384*384 | 1 | 222 ms | 678 ms | 7.0 tokens/sec| 5.6GiB | 5.6GiB |

**Video Process**
|Chips| input size | image num | image encoder |ttft(600 tokens) | w4a16 | CMM | Flash |
|--|--|--|--|--|--|--|--|
|AX650| 384*384 | 8  | 773 ms | 1887 ms | 7.1 tokens/sec| 5.6GiB | 5.6GiB |

**Image Process (Image Encoder U8+U16 Quantization)**
|Chips| input size | image num | image encoder | ttft(168 tokens) | w4a16 | CMM | Flash |
|--|--|--|--|--|--|--|--|
|AX650| 384*384 | 1 | 143 ms | 678 ms | 7.0 tokens/sec| 5.6GiB | 5.6GiB |

**Video Process (Image Encoder U8+U16 Quantization)**
|Chips| input size | image num | image encoder |ttft(600 tokens) | w4a16 | CMM | Flash |
|--|--|--|--|--|--|--|--|
|AX650| 384*384 | 8  | 498 ms | 1887 ms | 7.1 tokens/sec| 5.6GiB | 5.6GiB |

The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value.

## How to use

Download all files from this repository to the device

**If you using AX650 Board**

### Demo Run

#### Image understand demo

- input text

```
描述这张图片
```

- input image

![](./images/recoAll_attractions_1.jpg)

```
root@ax650 ~/Qwen3-VL-4B-Instruct-GPTQ-Int4 # bash run_image_ax650.sh 
[I][                            Init][ 156]: LLM init start
[I][                            Init][ 158]: Total CMM:7884 MB
[I][                            Init][  34]: connect http://127.0.0.1:8080 ok
bos_id: -1, eos_id: 151645
img_start_token: 151652
img_context_token: 151655
  2% | █                                 |   1 /  39 [0.01s<0.58s, 66.67 count/s] tokenizer init ok[I][                            Init][  26]: LLaMaEmbedSelector use mmap
  5% | ██                                |   2 /  39 [0.02s<0.37s, 105.26 count/s] embed_selector init ok[I][                            Init][ 201]: attr.axmodel_num:36
102% | █████████████████████████████████ |  40 /  39 [11.33s<11.05s, 3.53 count/s] init vpm axmodel ok,remain_cmm(2199 MB)[I][                            Init][ 266]: IMAGE_CONTEXT_TOKEN: 151655, IMAGE_START_TOKEN: 151652
[I][                            Init][ 309]: image encoder output float32

[I][                            Init][ 339]: max_token_len : 2047
[I][                            Init][ 344]: kv_cache_size : 1024, kv_cache_num: 2047
[I][                            Init][ 352]: prefill_token_num : 128
[I][                            Init][ 356]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 356]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 356]: grp: 3, prefill_max_token_num : 256
[I][                            Init][ 356]: grp: 4, prefill_max_token_num : 384
[I][                            Init][ 356]: grp: 5, prefill_max_token_num : 512
[I][                            Init][ 356]: grp: 6, prefill_max_token_num : 640
[I][                            Init][ 356]: grp: 7, prefill_max_token_num : 768
[I][                            Init][ 356]: grp: 8, prefill_max_token_num : 896
[I][                            Init][ 356]: grp: 9, prefill_max_token_num : 1024
[I][                            Init][ 356]: grp: 10, prefill_max_token_num : 1152
[I][                            Init][ 360]: prefill_max_token_num : 1152
[I][                            Init][ 372]: LLM init ok
[I][                            Init][ 374]: Left CMM:2199 MB
Type "q" to exit, Ctrl+c to stop current running
prompt >> 描述这张图片
image >> images/recoAll_attractions_1.jpg
[I][                     EncodeImage][ 440]: pixel_values size 1
[I][                     EncodeImage][ 441]: grid_h 24 grid_w 24
[I][                     EncodeImage][ 489]: image encode time : 222.440994 ms, size : 1
[I][                          Encode][ 532]: input_ids size:168
[I][                          Encode][ 540]: offset 15
[I][                          Encode][ 569]: img_embed.size:1, 368640
[I][                          Encode][ 583]: out_embed size:430080
[I][                          Encode][ 584]: input_ids size 168
[I][                          Encode][ 586]: position_ids size:168
[I][                             Run][ 607]: input token num : 168, prefill_split_num : 2
[I][                             Run][ 641]: input_num_token:128
[I][                             Run][ 641]: input_num_token:40
[I][                             Run][ 865]: ttft: 676.16 ms
这张图片展示了埃及吉萨的金字塔群，背景是晴朗的蓝天，前景是广阔的沙漠。

画面中主要可见三座金字塔：
- 最大的一座是著名的**胡夫金字塔**，它位于画面中央偏左，是三座金字塔中最高、最显眼的。
- 在其右侧，是稍小一些的**卡纳克金字塔**（或称“卡纳克金字塔”）。
- 在画面最左侧，可以看到一座更小的金字塔，可能是**门卡乌金字塔**或**哈夫拉金字塔**。

这三座金字塔都是古埃及法老的陵墓，是古代世界七大奇迹中唯一现存的。它们的结构和规模令人惊叹，体现了古埃及人在建筑、数学和天文学方面的卓越成就。

整个场景在阳光下显得庄严而神秘，是埃及最具代表性的历史遗迹之一。

[N][                             Run][ 992]: hit eos,avg 7.12 token/s
```

#### Video understand demo

- input text  

```
描述这个视频
```

- input video  

./video  

```
root@ax650 ~/Qwen3-VL-4B-Instruct-GPTQ-Int4 # bash run_video_ax650.sh 
[I][                            Init][ 156]: LLM init start
[I][                            Init][ 158]: Total CMM:7884 MB
[I][                            Init][  34]: connect http://127.0.0.1:8080 ok
bos_id: -1, eos_id: 151645
img_start_token: 151652
img_context_token: 151656
  2% | █                                 |   1 /  39 [0.02s<0.62s, 62.50 count/s] tokenizer init ok[I][                            Init][  26]: LLaMaEmbedSelector use mmap
  5% | ██                                |   2 /  39 [0.02s<0.39s, 100.00 count/s] embed_selector init ok[I][                            Init][ 201]: attr.axmodel_num:36
102% | █████████████████████████████████ |  40 /  39 [44.70s<43.58s, 0.89 count/s] init vpm axmodel ok,remain_cmm(2199 MB)[I][                            Init][ 266]: IMAGE_CONTEXT_TOKEN: 151656, IMAGE_START_TOKEN: 151652
[I][                            Init][ 309]: image encoder output float32

[I][                            Init][ 339]: max_token_len : 2047
[I][                            Init][ 344]: kv_cache_size : 1024, kv_cache_num: 2047
[I][                            Init][ 352]: prefill_token_num : 128
[I][                            Init][ 356]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 356]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 356]: grp: 3, prefill_max_token_num : 256
[I][                            Init][ 356]: grp: 4, prefill_max_token_num : 384
[I][                            Init][ 356]: grp: 5, prefill_max_token_num : 512
[I][                            Init][ 356]: grp: 6, prefill_max_token_num : 640
[I][                            Init][ 356]: grp: 7, prefill_max_token_num : 768
[I][                            Init][ 356]: grp: 8, prefill_max_token_num : 896
[I][                            Init][ 356]: grp: 9, prefill_max_token_num : 1024
[I][                            Init][ 356]: grp: 10, prefill_max_token_num : 1152
[I][                            Init][ 360]: prefill_max_token_num : 1152
[I][                            Init][ 372]: LLM init ok
[I][                            Init][ 374]: Left CMM:2199 MB
Type "q" to exit, Ctrl+c to stop current running
prompt >> 描述这个视频
video >> video
video/frame_0000.jpg
video/frame_0008.jpg
video/frame_0016.jpg
video/frame_0024.jpg
video/frame_0032.jpg
video/frame_0040.jpg
video/frame_0048.jpg
video/frame_0056.jpg
[I][                     EncodeImage][ 440]: pixel_values size 4
[I][                     EncodeImage][ 441]: grid_h 24 grid_w 24
[I][                     EncodeImage][ 489]: image encode time : 773.406006 ms, size : 4
[I][                          Encode][ 532]: input_ids size:600
[I][                          Encode][ 540]: offset 15
[I][                          Encode][ 569]: img_embed.size:4, 368640
[I][                          Encode][ 574]: offset:159
[I][                          Encode][ 574]: offset:303
[I][                          Encode][ 574]: offset:447
[I][                          Encode][ 583]: out_embed size:1536000
[I][                          Encode][ 584]: input_ids size 600
[I][                          Encode][ 586]: position_ids size:600
[I][                             Run][ 607]: input token num : 600, prefill_split_num : 5
[I][                             Run][ 641]: input_num_token:128
[I][                             Run][ 641]: input_num_token:128
[I][                             Run][ 641]: input_num_token:128
[I][                             Run][ 641]: input_num_token:128
[I][                             Run][ 641]: input_num_token:88

[I][                             Run][ 865]: ttft: 1886.83 ms
这个视频展示了一群**土拨鼠**（或称“旱獭”）在山间草地上嬉戏打斗的场景。

**画面细节：**

- **主体动物**：画面中有多只土拨鼠，它们毛色以灰、棕、白相间，腹部和四肢颜色较浅，背部较深。它们体型圆润，耳朵短小，表情生动。
- **动作**：这些土拨鼠似乎在进行一场“打斗”或“嬉戏”。它们互相扑腾、跳跃、用前爪拍打、甚至互相“拥抱”或“推搡”。动作非常活跃，充满动感，有些画面甚至有轻微的运动模糊，增强了动态感。
- **背景**：背景是连绵起伏的山峦，山坡上覆盖着绿色植被，远处可见裸露的岩石和山体，天空湛蓝，阳光明媚，说明是白天晴朗的天气。
- **前景**：它们站在一片布满小石子和草的地面，看起来像是山间小径或开阔地。
- **构图**：画面采用近景特写，聚焦于土拨鼠的互动，背景虚化，突出了主体的动态和表情。整体构图充满活力和趣味性。

**风格与氛围：**

- ��张图片/视频具有**拟人化和趣味性**，土拨鼠的动作被夸张化，仿佛在“打斗”或“跳舞”，非常可爱。
- 画面色彩明亮，阳光充足，给人一种**自然、活泼、欢乐**的感觉。

**总结：**

这是一段充满趣味和活力的野生动物短片，展现了土拨鼠在自然环境中的社交行为，它们的“打斗”其实可能是玩耍、争夺领地或建立社交关系的自然行为。整体画面生动、可爱，极具观赏性。

---

**注意**：虽然土拨鼠（旱獭）在野外确实会互相打斗，但这种“打斗”通常是**玩耍或社交行为**，并非真正的攻击。视频中的“打斗”更像是它们的社交互动，非常可爱。

[N][                             Run][ 992]: hit eos,avg 7.10 token/s

prompt >> q
```

### Gradio demo

#### start openai style api server
if the tokenizer server is not run in the same machine,please modify the tokenizer server ip in shell file.
```shell
pip3 install -r requirements.txt
# for axcl x86
./run_axcl_x86_api.sh
# for axcl aarch64
./run_axcl_aarch64_api.sh
# for ax650
./run_ax650_api.sh
```

#### start gradio demo
if the api server is not run in the same machine,please modify the api url in gradio web ui.
```shell
python gradio_demo.py
```

![image](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F64b7837c17570fdff9b906b9%2FOg9fPNi0chg768gicse7M.png)