UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface Paper • 2503.01342 • Published 7 days ago • 7
TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters Paper • 2410.23168 • Published Oct 30, 2024 • 24
LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference Paper • 2406.18139 • Published Jun 26, 2024 • 2
Distilling an End-to-End Voice Assistant Without Instruction Training Data Paper • 2410.02678 • Published Oct 3, 2024 • 23
Real-time Holistic Robot Pose Estimation with Unknown States Paper • 2402.05655 • Published Feb 8, 2024
MIBench: Evaluating Multimodal Large Language Models over Multiple Images Paper • 2407.15272 • Published Jul 21, 2024 • 10