wren93 commited on
Commit
4a06f31
β€’
1 Parent(s): df23366

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -1
README.md CHANGED
@@ -3,4 +3,32 @@ license: mit
3
  pipeline_tag: video-text-to-text
4
  ---
5
 
6
- This repository contains the model described in [VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation](https://huggingface.co/papers/2412.00927).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  pipeline_tag: video-text-to-text
4
  ---
5
 
6
+ # VISTA-LongVA
7
+
8
+ This repo contains model checkpoints for **VISTA-LongVA**. [VISTA](https://huggingface.co/papers/2412.00927) is a video spatiotemporal augmentation method that generates long-duration and high-resolution video instruction-following data to enhance the video understanding capabilities of video LMMs.
9
+
10
+ ### This repo is under construction. Please stay tuned.
11
+ [**🌐 Homepage**](https://tiger-ai-lab.github.io/VISTA/) | [**πŸ“– arXiv**](https://arxiv.org/abs/2412.00927) | [**πŸ’» GitHub**](https://github.com/TIGER-AI-Lab/VISTA) | [**πŸ€— VISTA-400K**](https://huggingface.co/datasets/TIGER-Lab/VISTA-400K) | [**πŸ€— Models**](https://huggingface.co/collections/TIGER-Lab/vista-674a2f0fab81be728a673193) | [**πŸ€— HRVideoBench**](https://huggingface.co/datasets/TIGER-Lab/HRVideoBench)
12
+
13
+ ## Video Instruction Data Synthesis Pipeline
14
+ <p align="center">
15
+ <img src="https://tiger-ai-lab.github.io/VISTA/static/images/vista_main.png" width="900">
16
+ </p>
17
+
18
+ VISTA leverages insights from image and video classification data augmentation techniques such as CutMix, MixUp and VideoMix, which demonstrate that training on synthetic data created by overlaying or mixing multiple images or videos results in more robust classifiers. Similarly, our method spatially and temporally combines videos to create (artificial) augmented video samples with longer durations and higher resolutions, followed by synthesizing instruction data based on these new videos. Our data synthesis pipeline utilizes existing public video-caption datasets, making it fully open-sourced and scalable. This allows us to construct VISTA-400K, a high-quality video instruction-following dataset aimed at improving the long and high-resolution video understanding capabilities of video LMMs.
19
+
20
+
21
+
22
+ ## Citation
23
+ If you find our paper useful, please cite us with
24
+ ```
25
+ @misc{ren2024vistaenhancinglongdurationhighresolution,
26
+ title={VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation},
27
+ author={Weiming Ren and Huan Yang and Jie Min and Cong Wei and Wenhu Chen},
28
+ year={2024},
29
+ eprint={2412.00927},
30
+ archivePrefix={arXiv},
31
+ primaryClass={cs.CV},
32
+ url={https://arxiv.org/abs/2412.00927},
33
+ }
34
+ ```