MM-VID: Advancing Video Understanding with GPT-4V(ision) Paper • 2310.19773 • Published Oct 30, 2023 • 19
Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models Paper • 2310.05863 • Published Oct 9, 2023 • 1
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks Paper • 2311.06242 • Published Nov 10, 2023 • 86
I&S-ViT: An Inclusive & Stable Method for Pushing the Limit of Post-Training ViTs Quantization Paper • 2311.10126 • Published Nov 16, 2023 • 7
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection Paper • 2311.10122 • Published Nov 16, 2023 • 26
Text-Conditioned Resampler For Long Form Video Understanding Paper • 2312.11897 • Published Dec 19, 2023 • 5
Vamos: Versatile Action Models for Video Understanding Paper • 2311.13627 • Published Nov 22, 2023 • 2