Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure
Abstract
Vision-Language Models (VLMs) can process visual and textual information in multiple formats: texts, images, interleaved texts and images, or even hour-long videos. In this work, we conduct fine-grained quantitative and qualitative analyses of automatic summarization of multimodal presentations using VLMs with various representations as input. From these experiments, we suggest cost-effective strategies for generating summaries from text-heavy multimodal documents under different input-length budgets using VLMs. We show that slides extracted from the video stream can be beneficially used as input against the raw video, and that a structured representation from interleaved slides and transcript provides the best performance. Finally, we reflect and comment on the nature of cross-modal interactions in multimodal presentations and share suggestions to improve the capabilities of VLMs to understand documents of this nature.
Community
Vision-Language Models (VLMs) can process visual and textual information in multiple formats: texts, images, interleaved texts and images, or even hour-long videos. In this work, we conduct fine-grained quantitative and qualitative analyses of automatic summarization of multimodal presentations using VLMs with various representations as input. From these experiments, we suggest cost-effective strategies for generating summaries from text-heavy multimodal documents under different input-length budgets using VLMs. We show that slides extracted from the video stream can be beneficially used as input against the raw video, and that a structured representation from interleaved slides and transcript provides the best performance. Finally, we reflect and comment on the nature of cross-modal interactions in multimodal presentations and share suggestions to improve the capabilities of VLMs to understand documents of this nature.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization? (2025)
- A Cascaded Architecture for Extractive Summarization of Multimedia Content via Audio-to-Text Alignment (2025)
- VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering (2025)
- Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection (2025)
- SViQA: A Unified Speech-Vision Multimodal Model for Textless Visual Question Answering (2025)
- Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs (2025)
- DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper