PUMA: Empowering Unified MLLM with Multi-granular Visual Generation Paper • 2410.13861 • Published Oct 17 • 52
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation Paper • 2411.07975 • Published Nov 12 • 27
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization Paper • 2411.10442 • Published Nov 15 • 68
Multimodal Autoregressive Pre-training of Large Vision Encoders Paper • 2411.14402 • Published Nov 21 • 43
DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding Paper • 2411.14347 • Published Nov 21 • 13
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models Paper • 2411.14982 • Published Nov 22 • 16
Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction Paper • 2411.14762 • Published Nov 22 • 11
TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives Paper • 2411.02545 • Published Nov 4 • 1
Hymba: A Hybrid-head Architecture for Small Language Models Paper • 2411.13676 • Published Nov 20 • 39
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory Paper • 2411.11922 • Published Nov 18 • 18
ShowUI: One Vision-Language-Action Model for GUI Visual Agent Paper • 2411.17465 • Published 30 days ago • 76
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration Paper • 2411.17686 • Published 29 days ago • 18
DreamMix: Decoupling Object Attributes for Enhanced Editability in Customized Image Inpainting Paper • 2411.17223 • Published 30 days ago • 5
FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity Paper • 2411.15411 • Published Nov 23 • 7
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI Paper • 2411.14522 • Published Nov 21 • 31
Knowledge Transfer Across Modalities with Natural Language Supervision Paper • 2411.15611 • Published Nov 23 • 15
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding Paper • 2411.18363 • Published 29 days ago • 9
EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality Paper • 2411.15241 • Published Nov 22 • 5
Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient Paper • 2411.17787 • Published 30 days ago • 11
On Domain-Specific Post-Training for Multimodal Large Language Models Paper • 2411.19930 • Published 26 days ago • 24
One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos Paper • 2409.19603 • Published Sep 29 • 18
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding Paper • 2406.19389 • Published Jun 27 • 52
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning Paper • 2412.03248 • Published 22 days ago • 25
CompCap: Improving Multimodal Large Language Models with Composite Captions Paper • 2412.05243 • Published 19 days ago • 18
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion Paper • 2412.04424 • Published 20 days ago • 55
POINTS1.5: Building a Vision-Language Model towards Real World Applications Paper • 2412.08443 • Published 15 days ago • 38
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions Paper • 2412.08737 • Published 14 days ago • 51
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding Paper • 2412.09604 • Published 13 days ago • 35
LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer Paper • 2412.13871 • Published 8 days ago • 17
AnySat: An Earth Observation Model for Any Resolutions, Scales, and Modalities Paper • 2412.14123 • Published 7 days ago • 11
FastVLM: Efficient Vision Encoding for Vision Language Models Paper • 2412.13303 • Published 8 days ago • 13
Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models Paper • 2412.05939 • Published 18 days ago • 12
Grounding Descriptions in Images informs Zero-Shot Visual Recognition Paper • 2412.04429 • Published 20 days ago