MoVA: Adapting Mixture of Vision Experts to Multimodal Context Paper • 2404.13046 • Published Apr 19 • 1
EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM Paper • 2412.09618 • Published 13 days ago • 21
MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines Paper • 2409.12959 • Published Sep 19 • 36
MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines Paper • 2409.12959 • Published Sep 19 • 36
Temporal Enhanced Training of Multi-view 3D Object Detector via Historical Object Prediction Paper • 2304.00967 • Published Apr 3, 2023
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models Paper • 2407.07895 • Published Jul 10 • 40
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding Paper • 2406.09411 • Published Jun 13 • 18
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching Paper • 2404.03653 • Published Apr 4 • 33
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? Paper • 2403.14624 • Published Mar 21 • 51
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention Paper • 2303.16199 • Published Mar 28, 2023 • 4
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model Paper • 2304.15010 • Published Apr 28, 2023 • 4
JourneyDB: A Benchmark for Generative Image Understanding Paper • 2307.00716 • Published Jul 3, 2023 • 19
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models Paper • 2306.11732 • Published Jun 15, 2023
Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement Paper • 2304.01195 • Published Apr 3, 2023
MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection Paper • 2203.13310 • Published Mar 24, 2022