FastVLM-1.5B

This version of FastVLM-1.5B has been converted to run on the Axera NPU using w8a16 quantization.

This model has been optimized with the following LoRA:

Compatible with Pulsar2 version: 5.1-patch1.

Please note that the context of the model is 1k and the maximum prefill length is 640 tokens.

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo:

https://huggingface.co/apple/FastVLM-1.5B

How to Convert LLM from Huggingface to axmodel[TODO]

Support Platform

Chips image encoder ttft w8a16
AX650 216.257 ms (1024x1024) 861.213 ms (291tokens) 11.90 tokens/sec
AX650 44.747 ms (512x512) 221.357 ms (99tokens) 11.90 tokens/sec

How to use

Download all files from this repository to the device

$ tree -L 1
.
├── fastvlm_ax650_context_1k_prefill_640
├── fastvlm_tokenizer
├── images
├── infer_axmodel.py
├── README.md
└── utils

5 directories, 2 files

Install transformer

pip install -r requirements.txt

Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650 DEMO Board

Run the following command on the Axera board to start a chat conversation:

$ python infer_axmodel.py -v ./fastvlm_ax650_context_1k_prefill_640/image_encoder_1024x1024.axmodel -m ./fastvlm_ax650_context_1k_prefill_640 -t ./fastvlm_tokenizer/ -i 1024

output:

[INFO] Available providers:  ['AXCLRTExecutionProvider']
Loading config, tokenizer and init model.
Detected prefixes: ['llava_qwen2'], chosen: llava_qwen2, layers: 28
Init InferenceSession:   0%|                                                                                                                          | 0/28 [00:00<?, ?it/s][INFO] Using provider: AXCLRTExecutionProvider
[INFO] SOC Name: AX650N
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Compiler version: 5.1-patch1-dirty 140e8d4a-dirty
Init InferenceSession:   4%|████                                                                                                              | 1/28 [00:01<00:28,  1.05s/it][INFO] Using provider: AXCLRTExecutionProvider
[INFO] SOC Name: AX650N
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Compiler version: 5.1-patch1-dirty 140e8d4a-dirty
Init InferenceSession:   7%|████████▏                                                                                                         | 2/28 [00:01<00:21,  1.20it/s][INFO] Using provider: AXCLRTExecutionProvider
...
[INFO] SOC Name: AX650N
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Compiler version: 5.1-patch1-dirty 140e8d4a-dirty
Init InferenceSession: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:19<00:00,  1.43it/s]
[INFO] Using provider: AXCLRTExecutionProvider
[INFO] SOC Name: AX650N
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Compiler version: 5.1-patch1-dirty 140e8d4a-dirty
Model loaded successfully!
[INFO] Using provider: AXCLRTExecutionProvider
[INFO] SOC Name: AX650N
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Compiler version: 5.1-patch1-dirty 140e8d4a-dirty
[INFO]: 输入文本进行对话,或者输入图片路径进行图片理解, 或者输入q退出对话。
prompt<<who are you
slice_indices: [0]
Slice prefill done: 0
answer >> I am an artificial intelligence designed and developed by Apple Inc. I am a natural language processing model that can understand and respond to user input in a conversational manner. I can answer questions, provide information, and engage in discussions on a wide range of topics. I am designed to be helpful, informative, and friendly, and I am constantly learning and improving to provide the best possible experience for users.

prompt<<./images/ssd_horse.jpg
slice_indices: [0, 1, 2]
Slice prefill done: 0
Slice prefill done: 1
Slice prefill done: 2
answer >> The image depicts a serene outdoor scene featuring a person riding a brown horse with a white blaze on its face. The rider, who has short brown hair, is wearing a blue hoodie, blue jeans, and black boots. The horse is equipped with a saddle and a bridle, and it stands on a dirt ground.

In the foreground, a brown dog with a pink collar is sitting on the ground, looking up at the rider with its mouth open, possibly in anticipation or excitement.

In the background, there is a silver pickup truck parked near a fence, and beyond the fence, there are trees and a few people sitting on a bench. The sky is overcast, suggesting a cloudy day. The overall atmosphere of the image is calm and peaceful, capturing a moment of connection between the rider, the horse, and the dog.

prompt<<./images/image_1.jpg
slice_indices: [0, 1, 2]
Slice prefill done: 0
Slice prefill done: 1
Slice prefill done: 2
answer >> The image depicts a panda bear in a natural setting. The panda is sitting on the ground, surrounded by green bamboo leaves and plants. The panda has a distinctive black and white fur pattern, with black patches around its eyes, ears, and limbs, and a white face and body. The panda appears to be holding a bamboo leaf in its mouth, which is a common food source for pandas. The background includes a wooden structure, possibly a part of a bamboo enclosure, and some rocks. The overall scene suggests that the panda is in a zoo or a wildlife sanctuary.

prompt<<q
[INFO]: 对话结束,再见。
Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AXERA-TECH/FastVLM-1.5B

Base model

apple/FastVLM-1.5B
Finetuned
(2)
this model