SmolVLM-256M-Instruct-Int8

This version of SmolVLM-256M-Instruct has been converted to run on the Axera NPU using w8a16 quantization.

This model has been optimized with the following LoRA:

Compatible with Pulsar2 version: 3.3

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo : https://huggingface.co./HuggingFaceTB/SmolVLM-256M-Instruct

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

AXera NPU HOST LLM Runtime

Support Platform

AX650
- AX650N DEMO Board
- M4N-Dock(爱芯派Pro)
- M.2 Accelerator card
AX630C

Chips	image encoder 512	ttft	w8a16
AX650	105 ms	57 ms	80 tokens/sec
AX630C	800 ms	182 ms	31 tokens/sec

How to use

Download all files from this repository to the device

root@ax650:/mnt/qtang/llm-test/smolvlm-256m # tree -L 1
.
├── main
├── post_config.json
├── run_smolvlm_ax630c.sh
├── run_smolvlm_ax650.sh
├── smolvlm-256m-ax630c
├── smolvlm-256m-ax650
├── smolvlm_tokenizer
├── smolvlm_tokenizer_512.py
└── ssd_car.jpg

Install transformer

pip install transformers==4.41.1

Start the Tokenizer service

root@ax650:/mnt/qtang/llm-test/smolvlm-256m# python smolvlm_tokenizer_512.py --port 12345
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
1 <|im_start|> 49279 <end_of_utterance>
[1, 11126, 42, 49189, 49152, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190,
49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190,
49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190,
49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190,
49190, 49190, 49190, 49190, 49190, 49190, 49190, 49190, 49189, 7306, 346, 5125, 451, 2443, 47, 49279,
198, 9519, 9531, 42]
81
[1, 11126, 42, 28120, 905, 49279, 198, 9519, 9531, 42]
10
http://localhost:12345

Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board

input text

Describe the picture

input image

Open another terminal and run ./run_smolvlm_ax650.sh

root@ax650:/mnt/qtang/llm-test/smolvlm-256m# ./run_smolvlm_ax650.sh
[I][                            Init][ 106]: LLM init start
bos_id: 1, eos_id: 49279
  2% | █                                 |   1 /  34 [0.00s<0.14s, 250.00 count/s] tokenizer init ok
[I][                            Init][  26]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ |  34 /  34 [0.67s<0.67s, 50.90 count/s] init vpm axmodel ok,remain_cmm(11698 MB)B)
[I][                            Init][ 254]: max_token_len : 1023
[I][                            Init][ 259]: kv_cache_size : 192, kv_cache_num: 1023
[I][                            Init][ 267]: prefill_token_num : 128
[I][                            Init][ 269]: vpm_height : 512,vpm_width : 512
[I][                            Init][ 279]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
prompt >> Describe the picture
image >> ./ssd_car.jpg
[I][                          Encode][ 338]: image encode time : 104.691002 ms, size : 36864
[I][                             Run][ 549]: ttft: 58.01 ms
 The image depicts a double decker bus, which is prominently displayed in the center of the image. The bus is red and has a large, bold sign on its roof that reads
"Things Get More Exciting When You Say So." The sign is in white text, and the bus is designed to be eye-catching and visually appealing.

The bus is parked on a city street, with a few other vehicles visible in the background. The street is lined with buildings, including a few shops and restaurants,
which are partially visible. The buildings are well-lit, and the street is clean and well-maintained.

In the foreground, there is a person standing in front of the bus. The person is wearing a dark jacket and appears to be waiting for the bus. The person is facing the bus,
and they seem to be waiting for the bus to arrive.

The bus is parked on the street, and it is not moving. The bus is not moving, and there are no other vehicles visible in the image. The street is well-maintained,
and the buildings are well-lit, indicating that it is a sunny day.

The image is taken from a slightly elevated perspective, which gives a clear view of the bus and the surrounding area. The lighting in the image is bright,
and the shadows are well-defined, indicating that the sun is shining brightly.

To summarize, the image depicts:
1. A double-decker bus with a large sign on its roof that reads "Things Get More Exciting When You Say So."
2. The bus is parked on a city street with a few other vehicles visible in the background.
3. The bus is not moving, and there are no other vehicles visible in the image.
4. The street is well-maintained, and the buildings are well-lit, indicating a sunny day.

This description provides a comprehensive overview of the image, allowing a text model to answer any questions related to the image based on the description.

[N][                             Run][ 688]: hit eos,avg 80.54 token/s

prompt >> q
root@ax650:/mnt/qtang/llm-test/smolvlm-256m#

Inference with M.2 Accelerator card

What is M.2 Accelerator card?, Show this DEMO based on Raspberry PI 5.

TODO

AXERA-TECH
/

SmolVLM-256M-Instruct

SmolVLM-256M-Instruct-Int8

Convert tools links:

Support Platform

How to use

Install transformer

Start the Tokenizer service

Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board

Inference with M.2 Accelerator card

Model tree for AXERA-TECH/SmolVLM-256M-Instruct

Collections including AXERA-TECH/SmolVLM-256M-Instruct

Multimodal Models

HuggingFaceTB