Deliberation in Latent Space via Differentiable Cache Augmentation Paper • 2412.17747 • Published 2 days ago • 24
Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators? Paper • 2307.14023 • Published Jul 26, 2023 • 1
A Touch, Vision, and Language Dataset for Multimodal Alignment Paper • 2402.13232 • Published Feb 20 • 14
Can you Remove the Downstream Model for Speaker Recognition with Self-Supervised Speech Features? Paper • 2402.00340 • Published Feb 1 • 1
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs Paper • 2404.05719 • Published Apr 8 • 82
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second Paper • 2410.02073 • Published Oct 2 • 41
Computational Bottlenecks of Training Small-scale Large Language Models Paper • 2410.19456 • Published Oct 25 • 1
Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling Paper • 2405.21048 • Published May 31 • 13
Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum Paper • 2405.13226 • Published May 21 • 1
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities Paper • 2406.09406 • Published Jun 13 • 14
Multimodal Autoregressive Pre-training of Large Vision Encoders Paper • 2411.14402 • Published Nov 21 • 43
Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody? Paper • 2410.24019 • Published Oct 31 • 1
Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning Paper • 2401.06805 • Published Jan 10 • 2