metadata

license: apache-2.0
base_model:
  - liuhaotian/llava-v1.5-7b

LISA++ (LISA_Plus_7b): An Improved Baseline for Reasoning Segmentation with Large Language Model

🤗Data | 📄Paper

Model Card for LISA++ (LISA_Plus_7b)

Model Details

Developed by: Senqiao Yang, The Chinese University of Hong Kong & SmartMore
Model Type: Large Vision-Language Model (VLM) for reasoning segmentation
Language(s): Supports natural language queries in English
License: Apache 2.0
Base Model: Finetuned from liuhaotian/llava-v1.5-7b

Model Description

LISA++ (LISA_Plus_7b) is an improved baseline for reasoning segmentation with large language models. It enhances the capabilities of its predecessor by incorporating instance segmentation and enabling more natural, multi-turn dialogues through Segmentation in Dialogue (SiD). These advancements are achieved without structural changes or additional data sources, relying instead on curated samples from existing segmentation datasets.

Key Enhancements:

Instance Segmentation: Differentiates between different instances of the same category, providing more detailed scene analysis alongside existing multi-region semantic segmentation.
Segmentation in Dialogue (SiD): Improved capability for multi-turn dialogue, allowing the model to incorporate segmentation results directly into text responses, leading to more natural and flexible conversations.
Refined Data Curation: Uses datasets like COCO and ADE20K to improve segmentation and dialogue integration.

Intended Uses & Limitations

Direct Use

Interactive image understanding and segmentation
Multi-turn reasoning about segmented objects in images
Visual question-answering with spatial awareness

Out-of-Scope Use

Real-time medical or security applications without further validation
Applications requiring precise 3D object segmentation

How to Use

As of now, the model is not available via the Hugging Face Inference API. To use locally:

from transformers import pipeline

# Load LISA++
model = pipeline("image-segmentation", model="LISA_Plus_7b")

# Example usage
image_path = "example.jpg"
query = "Highlight all the cats in the image."
result = model(image_path, query)
print(result)

For further details, refer to the model repository.

Training Data

LISA++ is trained on curated samples from:

COCO Dataset: Common Objects in Context
ADE20K Dataset: Scene parsing dataset
Extended ReasonSeg Dataset: Enhanced for multi-target instance segmentation

The training data is structured to improve segmentation and dialogue capabilities.

Training Procedure

Base Model: Finetuned from liuhaotian/llava-v1.5-7b
Optimizer: [Specify optimizer, e.g., AdamW]
Training Steps: Trained on ReasonSeg-Inst and ReasonSeg-Sem datasets
Hardware: Trained on GPUs [Specify model, e.g., NVIDIA A100]
Loss Functions: Combination of segmentation and language modeling losses

Evaluation Results

LISA++ significantly improves segmentation accuracy compared to its predecessor:

ReasonSeg-Inst (Instance Segmentation Performance):
- AP50: 34.1% (vs. 13.7% in LISA-7B)
- AP75: 22.1% (vs. 6.6% in LISA-7B)
- mAP: 21.5% (vs. 7.2% in LISA-7B)
ReasonSeg-Sem (Semantic Segmentation Performance):
- gIoU: 64.2% (vs. 53.6% in LISA)
- cIoU: 68.1% (vs. 52.3% in LISA)

These results highlight LISA++'s enhanced capabilities in both instance and semantic segmentation tasks.

Bias, Risks, and Limitations

Bias: The model's performance is limited by biases in training datasets (COCO, ADE20K).
Limitations: May struggle with unseen object categories or highly cluttered scenes.
Ethical Considerations: Users should verify outputs before deploying in critical applications.

Environmental Impact

Hardware Used: NVIDIA A100 GPUs (or equivalent)
Training Duration: [Specify training time, if available]
Estimated Carbon Emissions: [Estimate, if available]

Citation

If you use LISA_Plus_7b in your research, please cite:

@article{yang2024lisa++,
  title={LISA++: An Improved Baseline for Reasoning Segmentation with Large Language Model},
  author={Senqiao Yang},
  journal={arXiv preprint arXiv:2312.17240},
  year={2024}
}

Contact Information

For questions or feedback, contact:

Author: Senqiao Yang

This AI generated model card provides an overview of LISA_Plus_7b's capabilities, training methodology, and evaluation metrics, reflecting the latest updates from the Hugging Face model repository and arXiv paper.