Multimodal Models
Collection
3 items
β’
Updated
This version of SmolVLM-256M-Instruct has been converted to run on the Axera NPU using w8a16 quantization.
This model has been optimized with the following LoRA:
Compatible with Pulsar2 version: 3.3
For those who are interested in model conversion, you can try to export axmodel through the original repo : https://huggingface.co./HuggingFaceTB/SmolVLM-256M-Instruct
Pulsar2 Link, How to Convert LLM from Huggingface to axmodel
Chips | image encoder 512 | ttft | w8a16 |
---|---|---|---|
AX650 | 105 ms | 57 ms | 80 tokens/sec |
AX630C | 800 ms | 182 ms | 31 tokens/sec |
Download all files from this repository to the device
root@ax650:/mnt/qtang/llm-test/smolvlm-256m # tree -L 1
.
βββ main
βββ post_config.json
βββ run_smolvlm_ax630c.sh
βββ run_smolvlm_ax650.sh
βββ smolvlm-256m-ax630c
βββ smolvlm-256m-ax650
βββ smolvlm_tokenizer
βββ smolvlm_tokenizer_512.py
βββ ssd_car.jpg
pip install transformers==4.41.1
root@ax650:/mnt/qtang/llm-test/smolvlm-256m# python smolvlm_tokenizer_512.py --port 12345
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
1 <|im_start|> 49279 <end_of_utterance>
[1, 11126, 42, 49189, 49152, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190,
49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190,
49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190,
49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190,
49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49189, 7306, 346, 5125, 451, 2443, 47, 49279,
198, 9519, 9531, 42]
81
[1, 11126, 42, 28120, 905, 49279, 198, 9519, 9531, 42]
10
http://localhost:12345
Describe the picture
Open another terminal and run ./run_smolvlm_ax650.sh
root@ax650:/mnt/qtang/llm-test/smolvlm-256m# ./run_smolvlm_ax650.sh
[I][ Init][ 106]: LLM init start
bos_id: 1, eos_id: 49279
2% | β | 1 / 34 [0.00s<0.14s, 250.00 count/s] tokenizer init ok
[I][ Init][ 26]: LLaMaEmbedSelector use mmap
100% | ββββββββββββββββββββββββββββββββ | 34 / 34 [0.67s<0.67s, 50.90 count/s] init vpm axmodel ok,remain_cmm(11698 MB)B)
[I][ Init][ 254]: max_token_len : 1023
[I][ Init][ 259]: kv_cache_size : 192, kv_cache_num: 1023
[I][ Init][ 267]: prefill_token_num : 128
[I][ Init][ 269]: vpm_height : 512,vpm_width : 512
[I][ Init][ 279]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
prompt >> Describe the picture
image >> ./ssd_car.jpg
[I][ Encode][ 338]: image encode time : 104.691002 ms, size : 36864
[I][ Run][ 549]: ttft: 58.01 ms
The image depicts a double decker bus, which is prominently displayed in the center of the image. The bus is red and has a large, bold sign on its roof that reads
"Things Get More Exciting When You Say So." The sign is in white text, and the bus is designed to be eye-catching and visually appealing.
The bus is parked on a city street, with a few other vehicles visible in the background. The street is lined with buildings, including a few shops and restaurants,
which are partially visible. The buildings are well-lit, and the street is clean and well-maintained.
In the foreground, there is a person standing in front of the bus. The person is wearing a dark jacket and appears to be waiting for the bus. The person is facing the bus,
and they seem to be waiting for the bus to arrive.
The bus is parked on the street, and it is not moving. The bus is not moving, and there are no other vehicles visible in the image. The street is well-maintained,
and the buildings are well-lit, indicating that it is a sunny day.
The image is taken from a slightly elevated perspective, which gives a clear view of the bus and the surrounding area. The lighting in the image is bright,
and the shadows are well-defined, indicating that the sun is shining brightly.
To summarize, the image depicts:
1. A double-decker bus with a large sign on its roof that reads "Things Get More Exciting When You Say So."
2. The bus is parked on a city street with a few other vehicles visible in the background.
3. The bus is not moving, and there are no other vehicles visible in the image.
4. The street is well-maintained, and the buildings are well-lit, indicating a sunny day.
This description provides a comprehensive overview of the image, allowing a text model to answer any questions related to the image based on the description.
[N][ Run][ 688]: hit eos,avg 80.54 token/s
prompt >> q
root@ax650:/mnt/qtang/llm-test/smolvlm-256m#
What is M.2 Accelerator card?, Show this DEMO based on Raspberry PI 5.
TODO
Base model
HuggingFaceTB/SmolLM2-135M