Dense-World
/

Sa2VA-4B

+---
+license: mit
+pipeline_tag: image-text-to-text
+library_name: transformers
+base_model:
+  - OpenGVLab/InternVL2_5-8B
+  - OpenGVLab/InternViT-300M-448px-V2_5
+  - internlm/internlm2_5-7b-chat
+base_model_relation: merge
+language:
+  - multilingual
+tags:
+  - Sa2VA
+  - custom_code
+---
+# Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
+[\[📂 GitHub\]](https://github.com/lxtGH/Sa2VA_opensource)
+[\[📜 Sa2VA paper\]]()
+[\[🚀 Quick Start\]](#quick-start)
+## Introduction
+Sa2VA is an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels. It achieves comparable performance to SOTA MLLMs Qwen2-VL and InternVL2.5 on question-answering benchmarks. Additionally, Sa2VA possesses the visual prompt understanding and dense object segmentation capabilities that SOTA MLLMs Qwen2-VL and InternVL2.5 lack. Sa2VA achieves SOTA performance on both image and video grounding and segmentation benchmarks.
+## Sa2VA Family
+We built the Sa2VA series based on Qwen2-VL and InternVL2/2.5. In the following table, we provide some Sa2VA models built on InternVL2.5. Other Sa2VA models will be open-sourced soon.
+| Model Name |                             Base MLLM                             |                                 Language Part                                 |                       HF Link                        |
+|:----------:|:-----------------------------------------------------------------:|:-----------------------------------------------------------------------------:|:----------------------------------------------------:|
+|  Sa2VA-1B  | [InternVL2.5-4B](https://huggingface.co/OpenGVLab/InternVL2_5-1B) |   [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)    | [🤗 link](https://huggingface.co/ByteDance/Sa2VA-1B) |
+|  Sa2VA-4B  | [InternVL2.5-4B](https://huggingface.co/OpenGVLab/InternVL2_5-4B) |    [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)     | [🤗 link](https://huggingface.co/ByteDance/Sa2VA-4B) |
+|  Sa2VA-8B  | [InternVL2.5-8B](https://huggingface.co/OpenGVLab/InternVL2_5-8B) |  [internlm2_5-7b-chat](https://huggingface.co/internlm/internlm2_5-7b-chat)   | [🤗 link](https://huggingface.co/ByteDance/Sa2VA-8B) |
+## Quick Start
+We provide an example code to run `Sa2VA` using `transformers`.
+```python
+import torch
+from transformers import AutoTokenizer, AutoModel
+from PIL import Image
+import numpy as np
+import os
+# load the model and tokenizer
+path = "ByteDance/Sa2VA-4B"
+model = AutoModel.from_pretrained(
+    path,
+    torch_dtype=torch.bfloat16,
+    low_cpu_mem_usage=True,
+    use_flash_attn=True,
+    trust_remote_code=True).eval().cuda()
+tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
+# for image chat
+image_path = "/PATH/TO/IMAGE"
+text_prompts = "Please describe the image."
+image = Image.open(image_path).convert('RGB')
+input_dict = {
+    'image': image,
+    'text': text_prompts,
+    'past_text': '',
+    'mask_prompts': None,
+    'tokenizer': tokenizer,
+    }
+return_dict = model.predict_forward(**input_dict)
+answer = return_dict["prediction"] # the text format answer
+# for image chat with segmentation output
+image_path = "/PATH/TO/IMAGE"
+text_prompts = "Could you please give me a brief description of the image? Please respond with interleaved segmentation masks for the corresponding parts of the answer."
+image = Image.open(image_path).convert('RGB')
+input_dict = {
+    'image': image,
+    'text': text_prompts,
+    'past_text': '',
+    'mask_prompts': None,
+    'tokenizer': tokenizer,
+    }
+return_dict = model.predict_forward(**input_dict)
+answer = return_dict["prediction"] # the text format answer
+masks = return_dict['prediction_masks']  # segmentation masks, list(np.array(1, h, w), ...)
+# for chat with visual prompt (mask format) input
+mask_prompts = np.load('/PATH/TO/pred_masks.npy') # np.array(n_prompts, h, w)
+image_path = "/PATH/TO/IMAGE"
+text_prompts = "Can you provide me with a detailed description of the region in the picture marked by region1."
+image = Image.open(image_path).convert('RGB')
+input_dict = {
+    'image': image,
+    'text': text_prompts,
+    'past_text': '',
+    'mask_prompts': mask_prompts,
+    'tokenizer': tokenizer,
+    }
+return_dict = model.predict_forward(**input_dict)
+answer = return_dict["prediction"] # the text format answer
+# for video chat
+video_folder = "/PATH/TO/VIDEO_FOLDER"
+images_paths = os.listdir(video_folder)
+images_paths = [os.path.join(video_folder, image_path) for image_name in images_paths]
+if len(images_paths) > 5:  # uniformly sample 5 frames
+    step = (len(images_paths) - 1) // (5 - 1)
+    images_paths = [images_paths[0]] + images_paths[1:-1][::step][1:] + [images_paths[-1]]
+text_prompts = "Please describe the video."
+input_dict = {
+    'video': images_paths,
+    'text': text_prompts,
+    'past_text': '',
+    'mask_prompts': None,
+    'tokenizer': tokenizer,
+}
+return_dict = model.predict_forward(**input_dict)
+answer = return_dict["prediction"] # the text format answer
+# for video chat with segmentation mask output
+video_folder = "/PATH/TO/VIDEO_FOLDER"
+images_paths = os.listdir(video_folder)
+images_paths = [os.path.join(video_folder, image_path) for image_name in images_paths]
+text_prompts = "Please segment the person."
+input_dict = {
+    'video': images_paths,
+    'text': text_prompts,
+    'past_text': '',
+    'mask_prompts': None,
+    'tokenizer': tokenizer,
+}
+return_dict = model.predict_forward(**input_dict)
+answer = return_dict["prediction"] # the text format answer
+masks = return_dict['prediction_masks']  # segmentation masks, list(np.array(n_frames, h, w), ...)
+```
+## Citation
+If you find this project useful in your research, please consider citing:
+```BibTeX
+@article{sa2va,
+  title={Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos},
+  author={Yuan, Haobo and Li, Xiangtai and Zhang, Tao and Huang, Zilong Huang and Xu, Shilin and Ji, Shunping and Tong, Yunhai and Qi, Lu and Feng, Jiashi and Yang, Ming-Hsuan},
+  journal={arXiv preprint},
+  year={2025}
+}
+```