LINGOLY-TOO: Disentangling Memorisation from Reasoning with Linguistic Templatisation and Orthographic Obfuscation Paper • 2503.02972 • Published 5 days ago • 23
UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface Paper • 2503.01342 • Published 7 days ago • 7
AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence Paper • 2502.13943 • Published 18 days ago • 7
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding Paper • 2501.13106 • Published Jan 22 • 84
Next Block Prediction: Video Generation via Semi-Autoregressive Modeling Paper • 2502.07737 • Published 26 days ago • 9
Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving Paper • 2502.07640 • Published 26 days ago • 8
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step Paper • 2501.13926 • Published Jan 23 • 37
Temporal Preference Optimization for Long-Form Video Understanding Paper • 2501.13919 • Published Jan 23 • 22
Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding Paper • 2501.07888 • Published Jan 14 • 15
VideoRAG: Retrieval-Augmented Generation over Video Corpus Paper • 2501.05874 • Published Jan 10 • 68
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs Paper • 2501.06186 • Published Jan 10 • 61
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks Paper • 2501.08326 • Published Jan 14 • 32
VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation Paper • 2412.21059 • Published Dec 30, 2024 • 18
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey Paper • 2412.18619 • Published Dec 16, 2024 • 55
Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization Paper • 2412.18525 • Published Dec 24, 2024 • 75