Deliberation in Latent Space via Differentiable Cache Augmentation Paper • 2412.17747 • Published 2 days ago • 24
PaliGemma 2: A Family of Versatile VLMs for Transfer Paper • 2412.03555 • Published 21 days ago • 118
Mimir: Improving Video Diffusion Models for Precise Text Understanding Paper • 2412.03085 • Published 22 days ago • 12
ShowUI: One Vision-Language-Action Model for GUI Visual Agent Paper • 2411.17465 • Published 30 days ago • 76
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models Paper • 2411.13503 • Published Nov 20 • 30
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory Paper • 2411.11922 • Published Nov 18 • 18
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding Paper • 2410.17434 • Published Oct 22 • 25
Aurora Series: AuroraCap Collection Efficient, Performant Video Detailed Captioning and a New Benchmark • 8 items • Updated Oct 26 • 3
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark Paper • 2410.03051 • Published Oct 4 • 4
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second Paper • 2410.02073 • Published Oct 2 • 41
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning Paper • 2409.20566 • Published Sep 30 • 53
LLaVA-Onevision Collection LLaVa_Onevision models for single-image, multi-image, and video scenarios • 9 items • Updated Sep 18 • 12
Prithvi WxC: Foundation Model for Weather and Climate Paper • 2409.13598 • Published Sep 20 • 39
CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets Paper • 2406.13897 • Published May 30 • 12
InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning Paper • 2409.12568 • Published Sep 19 • 47
See and Think: Embodied Agent in Virtual Environment Paper • 2311.15209 • Published Nov 26, 2023 • 2