FastVLM-1.5B
This version of FastVLM-1.5B has been converted to run on the Axera NPU using w8a16 quantization.
This model has been optimized with the following LoRA:
Compatible with Pulsar2 version: 5.1-patch1.
Please note that the context of the model is 1k and the maximum prefill length is 640 tokens.
Convert tools links:
For those who are interested in model conversion, you can try to export axmodel through the original repo:
https://huggingface.co/apple/FastVLM-1.5B
How to Convert LLM from Huggingface to axmodel[TODO]
Support Platform
- AX650
- AX650N DEMO Board
- M4N-Dock(爱芯派Pro)
- M.2 Accelerator card
| Chips | image encoder | ttft | w8a16 |
|---|---|---|---|
| AX650 | 216.257 ms (1024x1024) | 861.213 ms (291tokens) | 11.90 tokens/sec |
| AX650 | 44.747 ms (512x512) | 221.357 ms (99tokens) | 11.90 tokens/sec |
How to use
Download all files from this repository to the device
$ tree -L 1
.
├── fastvlm_ax650_context_1k_prefill_640
├── fastvlm_tokenizer
├── images
├── infer_axmodel.py
├── README.md
└── utils
5 directories, 2 files
Install transformer
pip install -r requirements.txt
Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650 DEMO Board
Run the following command on the Axera board to start a chat conversation:
$ python infer_axmodel.py -v ./fastvlm_ax650_context_1k_prefill_640/image_encoder_1024x1024.axmodel -m ./fastvlm_ax650_context_1k_prefill_640 -t ./fastvlm_tokenizer/ -i 1024
output:
[INFO] Available providers: ['AXCLRTExecutionProvider']
Loading config, tokenizer and init model.
Detected prefixes: ['llava_qwen2'], chosen: llava_qwen2, layers: 28
Init InferenceSession: 0%| | 0/28 [00:00<?, ?it/s][INFO] Using provider: AXCLRTExecutionProvider
[INFO] SOC Name: AX650N
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Compiler version: 5.1-patch1-dirty 140e8d4a-dirty
Init InferenceSession: 4%|████ | 1/28 [00:01<00:28, 1.05s/it][INFO] Using provider: AXCLRTExecutionProvider
[INFO] SOC Name: AX650N
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Compiler version: 5.1-patch1-dirty 140e8d4a-dirty
Init InferenceSession: 7%|████████▏ | 2/28 [00:01<00:21, 1.20it/s][INFO] Using provider: AXCLRTExecutionProvider
...
[INFO] SOC Name: AX650N
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Compiler version: 5.1-patch1-dirty 140e8d4a-dirty
Init InferenceSession: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:19<00:00, 1.43it/s]
[INFO] Using provider: AXCLRTExecutionProvider
[INFO] SOC Name: AX650N
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Compiler version: 5.1-patch1-dirty 140e8d4a-dirty
Model loaded successfully!
[INFO] Using provider: AXCLRTExecutionProvider
[INFO] SOC Name: AX650N
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Compiler version: 5.1-patch1-dirty 140e8d4a-dirty
[INFO]: 输入文本进行对话,或者输入图片路径进行图片理解, 或者输入q退出对话。
prompt<<who are you
slice_indices: [0]
Slice prefill done: 0
answer >> I am an artificial intelligence designed and developed by Apple Inc. I am a natural language processing model that can understand and respond to user input in a conversational manner. I can answer questions, provide information, and engage in discussions on a wide range of topics. I am designed to be helpful, informative, and friendly, and I am constantly learning and improving to provide the best possible experience for users.
prompt<<./images/ssd_horse.jpg
slice_indices: [0, 1, 2]
Slice prefill done: 0
Slice prefill done: 1
Slice prefill done: 2
answer >> The image depicts a serene outdoor scene featuring a person riding a brown horse with a white blaze on its face. The rider, who has short brown hair, is wearing a blue hoodie, blue jeans, and black boots. The horse is equipped with a saddle and a bridle, and it stands on a dirt ground.
In the foreground, a brown dog with a pink collar is sitting on the ground, looking up at the rider with its mouth open, possibly in anticipation or excitement.
In the background, there is a silver pickup truck parked near a fence, and beyond the fence, there are trees and a few people sitting on a bench. The sky is overcast, suggesting a cloudy day. The overall atmosphere of the image is calm and peaceful, capturing a moment of connection between the rider, the horse, and the dog.
prompt<<./images/image_1.jpg
slice_indices: [0, 1, 2]
Slice prefill done: 0
Slice prefill done: 1
Slice prefill done: 2
answer >> The image depicts a panda bear in a natural setting. The panda is sitting on the ground, surrounded by green bamboo leaves and plants. The panda has a distinctive black and white fur pattern, with black patches around its eyes, ears, and limbs, and a white face and body. The panda appears to be holding a bamboo leaf in its mouth, which is a common food source for pandas. The background includes a wooden structure, possibly a part of a bamboo enclosure, and some rocks. The overall scene suggests that the panda is in a zoo or a wildlife sanctuary.
prompt<<q
[INFO]: 对话结束,再见。
- Downloads last month
- 16
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for AXERA-TECH/FastVLM-1.5B
Base model
apple/FastVLM-1.5B