Token-Efficient Long Video Understanding for Multimodal LLMs Paper • 2503.04130 • Published 4 days ago • 65
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs Paper • 2503.01743 • Published 6 days ago • 65
VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing Paper • 2502.17258 • Published 13 days ago • 72
Diffusion-Sharpening: Fine-tuning Diffusion Models with Denoising Trajectory Sharpening Paper • 2502.12146 • Published 20 days ago • 16
Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation Paper • 2502.08690 • Published 26 days ago • 41
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment Paper • 2502.10391 • Published 23 days ago • 31
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation Paper • 2502.08639 • Published 25 days ago • 37
Scaling Pre-training to One Hundred Billion Data for Vision Language Models Paper • 2502.07617 • Published 27 days ago • 29
Dual Caption Preference Optimization for Diffusion Models Paper • 2502.06023 • Published 28 days ago • 9
FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation Paper • 2502.05179 • Published about 1 month ago • 24
VideoRoPE: What Makes for Good Video Rotary Position Embedding? Paper • 2502.05173 • Published about 1 month ago • 64
Goku: Flow Based Video Generative Foundation Models Paper • 2502.04896 • Published about 1 month ago • 96
AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting Paper • 2502.05176 • Published about 1 month ago • 32
HiFi-SR: A Unified Generative Transformer-Convolutional Adversarial Network for High-Fidelity Speech Super-Resolution Paper • 2501.10045 • Published Jan 17 • 9
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding Paper • 2501.12380 • Published Jan 21 • 83
Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass Paper • 2501.13928 • Published Jan 23 • 17