arxiv:2412.09875

Selective State Space Memory for Large Vision-Language Models

Published on Dec 13, 2024

Authors:

Abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across a wide range of multimodal tasks. However, fine-tuning these models for domain-specific applications remains a computationally intensive challenge. This paper introduces State Space Memory Integration (SSMI), a novel approach for efficient fine-tuning of LVLMs. By integrating lightweight Mamba-based state space modules into the LVLM architecture, SSMI captures long-range dependencies and injects task-specific visual and sequential patterns effectively. Unlike traditional fine-tuning methods, SSMI requires only a fraction of the model's parameters to be updated, making it computationally efficient and scalable. Experiments on benchmark datasets, including COCO Captioning, VQA, and Flickr30k, demonstrate that SSMI achieves state-of-the-art performance while maintaining robustness and generalization capabilities. Comprehensive analysis further validates the advantages of SSMI in terms of efficiency, adaptability, and interpretability, positioning it as a compelling solution for fine-tuning large-scale vision-language models.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2412.09875 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2412.09875 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2412.09875 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.