Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation
Abstract
We introduce Presto, a novel video diffusion model designed to generate 15-second videos with long-range coherence and rich content. Extending video generation methods to maintain scenario diversity over long durations presents significant challenges. To address this, we propose a Segmented Cross-Attention (SCA) strategy, which splits hidden states into segments along the temporal dimension, allowing each segment to cross-attend to a corresponding sub-caption. SCA requires no additional parameters, enabling seamless incorporation into current DiT-based architectures. To facilitate high-quality long video generation, we build the LongTake-HD dataset, consisting of 261k content-rich videos with scenario coherence, annotated with an overall video caption and five progressive sub-captions. Experiments show that our Presto achieves 78.5% on the VBench Semantic Score and 100% on the Dynamic Degree, outperforming existing state-of-the-art video generation methods. This demonstrates that our proposed Presto significantly enhances content richness, maintains long-range coherence, and captures intricate textual details. More details are displayed on our project page: https://presto-video.github.io/.
Community
A recent advanced video diffusion model.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis (2024)
- LVD-2M: A Long-take Video Dataset with Temporally Dense Captions (2024)
- Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content (2024)
- ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos (2024)
- StereoCrafter-Zero: Zero-Shot Stereo Video Generation with Noisy Restart (2024)
- SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context (2024)
- Motion Control for Enhanced Complex Action Video Generation (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper