zhangtao-whu commited on
Commit
f2d254b
Β·
verified Β·
1 Parent(s): d45e0b2

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +152 -0
README.md ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: image-text-to-text
4
+ library_name: transformers
5
+ base_model:
6
+ - OpenGVLab/InternVL2_5-8B
7
+ - OpenGVLab/InternViT-300M-448px-V2_5
8
+ - internlm/internlm2_5-7b-chat
9
+ base_model_relation: merge
10
+ language:
11
+ - multilingual
12
+ tags:
13
+ - Sa2VA
14
+ - custom_code
15
+ ---
16
+
17
+ # Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
18
+
19
+ [\[πŸ“‚ GitHub\]](https://github.com/lxtGH/Sa2VA_opensource)
20
+ [\[πŸ“œ Sa2VA paper\]]()
21
+ [\[πŸš€ Quick Start\]](#quick-start)
22
+
23
+
24
+
25
+ ## Introduction
26
+
27
+ Sa2VA is an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels. It achieves comparable performance to SOTA MLLMs Qwen2-VL and InternVL2.5 on question-answering benchmarks. Additionally, Sa2VA possesses the visual prompt understanding and dense object segmentation capabilities that SOTA MLLMs Qwen2-VL and InternVL2.5 lack. Sa2VA achieves SOTA performance on both image and video grounding and segmentation benchmarks.
28
+
29
+ ## Sa2VA Family
30
+
31
+ We built the Sa2VA series based on Qwen2-VL and InternVL2/2.5. In the following table, we provide some Sa2VA models built on InternVL2.5. Other Sa2VA models will be open-sourced soon.
32
+
33
+ | Model Name | Base MLLM | Language Part | HF Link |
34
+ |:----------:|:-----------------------------------------------------------------:|:-----------------------------------------------------------------------------:|:----------------------------------------------------:|
35
+ | Sa2VA-1B | [InternVL2.5-4B](https://huggingface.co/OpenGVLab/InternVL2_5-1B) | [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) | [πŸ€— link](https://huggingface.co/ByteDance/Sa2VA-1B) |
36
+ | Sa2VA-4B | [InternVL2.5-4B](https://huggingface.co/OpenGVLab/InternVL2_5-4B) | [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) | [πŸ€— link](https://huggingface.co/ByteDance/Sa2VA-4B) |
37
+ | Sa2VA-8B | [InternVL2.5-8B](https://huggingface.co/OpenGVLab/InternVL2_5-8B) | [internlm2_5-7b-chat](https://huggingface.co/internlm/internlm2_5-7b-chat) | [πŸ€— link](https://huggingface.co/ByteDance/Sa2VA-8B) |
38
+
39
+
40
+ ## Quick Start
41
+
42
+ We provide an example code to run `Sa2VA` using `transformers`.
43
+
44
+ ```python
45
+ import torch
46
+ from transformers import AutoTokenizer, AutoModel
47
+ from PIL import Image
48
+ import numpy as np
49
+ import os
50
+
51
+ # load the model and tokenizer
52
+ path = "ByteDance/Sa2VA-4B"
53
+ model = AutoModel.from_pretrained(
54
+ path,
55
+ torch_dtype=torch.bfloat16,
56
+ low_cpu_mem_usage=True,
57
+ use_flash_attn=True,
58
+ trust_remote_code=True).eval().cuda()
59
+ tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
60
+
61
+ # for image chat
62
+ image_path = "/PATH/TO/IMAGE"
63
+ text_prompts = "Please describe the image."
64
+ image = Image.open(image_path).convert('RGB')
65
+ input_dict = {
66
+ 'image': image,
67
+ 'text': text_prompts,
68
+ 'past_text': '',
69
+ 'mask_prompts': None,
70
+ 'tokenizer': tokenizer,
71
+ }
72
+ return_dict = model.predict_forward(**input_dict)
73
+ answer = return_dict["prediction"] # the text format answer
74
+
75
+ # for image chat with segmentation output
76
+ image_path = "/PATH/TO/IMAGE"
77
+ text_prompts = "Could you please give me a brief description of the image? Please respond with interleaved segmentation masks for the corresponding parts of the answer."
78
+ image = Image.open(image_path).convert('RGB')
79
+ input_dict = {
80
+ 'image': image,
81
+ 'text': text_prompts,
82
+ 'past_text': '',
83
+ 'mask_prompts': None,
84
+ 'tokenizer': tokenizer,
85
+ }
86
+ return_dict = model.predict_forward(**input_dict)
87
+ answer = return_dict["prediction"] # the text format answer
88
+ masks = return_dict['prediction_masks'] # segmentation masks, list(np.array(1, h, w), ...)
89
+
90
+ # for chat with visual prompt (mask format) input
91
+ mask_prompts = np.load('/PATH/TO/pred_masks.npy') # np.array(n_prompts, h, w)
92
+ image_path = "/PATH/TO/IMAGE"
93
+ text_prompts = "Can you provide me with a detailed description of the region in the picture marked by region1."
94
+ image = Image.open(image_path).convert('RGB')
95
+ input_dict = {
96
+ 'image': image,
97
+ 'text': text_prompts,
98
+ 'past_text': '',
99
+ 'mask_prompts': mask_prompts,
100
+ 'tokenizer': tokenizer,
101
+ }
102
+ return_dict = model.predict_forward(**input_dict)
103
+ answer = return_dict["prediction"] # the text format answer
104
+
105
+ # for video chat
106
+ video_folder = "/PATH/TO/VIDEO_FOLDER"
107
+ images_paths = os.listdir(video_folder)
108
+ images_paths = [os.path.join(video_folder, image_path) for image_name in images_paths]
109
+ if len(images_paths) > 5: # uniformly sample 5 frames
110
+ step = (len(images_paths) - 1) // (5 - 1)
111
+ images_paths = [images_paths[0]] + images_paths[1:-1][::step][1:] + [images_paths[-1]]
112
+ text_prompts = "Please describe the video."
113
+ input_dict = {
114
+ 'video': images_paths,
115
+ 'text': text_prompts,
116
+ 'past_text': '',
117
+ 'mask_prompts': None,
118
+ 'tokenizer': tokenizer,
119
+ }
120
+ return_dict = model.predict_forward(**input_dict)
121
+ answer = return_dict["prediction"] # the text format answer
122
+
123
+
124
+ # for video chat with segmentation mask output
125
+ video_folder = "/PATH/TO/VIDEO_FOLDER"
126
+ images_paths = os.listdir(video_folder)
127
+ images_paths = [os.path.join(video_folder, image_path) for image_name in images_paths]
128
+ text_prompts = "Please segment the person."
129
+ input_dict = {
130
+ 'video': images_paths,
131
+ 'text': text_prompts,
132
+ 'past_text': '',
133
+ 'mask_prompts': None,
134
+ 'tokenizer': tokenizer,
135
+ }
136
+ return_dict = model.predict_forward(**input_dict)
137
+ answer = return_dict["prediction"] # the text format answer
138
+ masks = return_dict['prediction_masks'] # segmentation masks, list(np.array(n_frames, h, w), ...)
139
+ ```
140
+
141
+ ## Citation
142
+
143
+ If you find this project useful in your research, please consider citing:
144
+
145
+ ```BibTeX
146
+ @article{sa2va,
147
+ title={Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos},
148
+ author={Yuan, Haobo and Li, Xiangtai and Zhang, Tao and Huang, Zilong Huang and Xu, Shilin and Ji, Shunping and Tong, Yunhai and Qi, Lu and Feng, Jiashi and Yang, Ming-Hsuan},
149
+ journal={arXiv preprint},
150
+ year={2025}
151
+ }
152
+ ```