cooleel
's Collections
Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages
Paper
•
2410.16153
•
Published
•
44
AutoTrain: No-code training for state-of-the-art models
Paper
•
2410.15735
•
Published
•
59
The Curse of Multi-Modalities: Evaluating Hallucinations of Large
Multimodal Models across Language, Visual, and Audio
Paper
•
2410.12787
•
Published
•
31
LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks
Paper
•
2410.01744
•
Published
•
26
UCFE: A User-Centric Financial Expertise Benchmark for Large Language
Models
Paper
•
2410.14059
•
Published
•
59
NVLM: Open Frontier-Class Multimodal LLMs
Paper
•
2409.11402
•
Published
•
73
MIO: A Foundation Model on Multimodal Tokens
Paper
•
2409.17692
•
Published
•
53
Emu3: Next-Token Prediction is All You Need
Paper
•
2409.18869
•
Published
•
94
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art
Multimodal Models
Paper
•
2409.17146
•
Published
•
108
Analyzing The Language of Visual Tokens
Paper
•
2411.05001
•
Published
•
24
Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM
Data Contamination
Paper
•
2411.03823
•
Published
•
45
SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference
Acceleration
Paper
•
2410.02367
•
Published
•
48
VLM2Vec: Training Vision-Language Models for Massive Multimodal
Embedding Tasks
Paper
•
2410.05160
•
Published
•
4
Paper
•
2410.07073
•
Published
•
64
LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large
Multimodal Models
Paper
•
2410.09732
•
Published
•
55
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid
Visual Redundancy Reduction
Paper
•
2410.17247
•
Published
•
46
SageAttention2 Technical Report: Accurate 4 Bit Attention for
Plug-and-play Inference Acceleration
Paper
•
2411.10958
•
Published
•
53
Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of
Experts
Paper
•
2411.10669
•
Published
•
10
Autoregressive Models in Vision: A Survey
Paper
•
2411.05902
•
Published
•
18
LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation
Paper
•
2411.04997
•
Published
•
37
Mixture-of-Transformers: A Sparse and Scalable Architecture for
Multi-Modal Foundation Models
Paper
•
2411.04996
•
Published
•
51
Expanding Performance Boundaries of Open-Source Multimodal Models with
Model, Data, and Test-Time Scaling
Paper
•
2412.05271
•
Published
•
136
Progressive Multimodal Reasoning via Active Retrieval
Paper
•
2412.14835
•
Published
•
73
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
Paper
•
2412.14475
•
Published
•
55
POINTS1.5: Building a Vision-Language Model towards Real World
Applications
Paper
•
2412.08443
•
Published
•
38
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive
Survey
Paper
•
2412.18619
•
Published
•
55
FastVLM: Efficient Vision Encoding for Vision Language Models
Paper
•
2412.13303
•
Published
•
13
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents
Paper
•
2501.08828
•
Published
•
30
OmniDocBench: Benchmarking Diverse PDF Document Parsing with
Comprehensive Annotations
Paper
•
2412.07626
•
Published
•
22
OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary
Embedding Distillation
Paper
•
2412.09585
•
Published
•
11
Smaller Language Models Are Better Instruction Evolvers
Paper
•
2412.11231
•
Published
•
27
ZeroBench: An Impossible Visual Benchmark for Contemporary Large
Multimodal Models
Paper
•
2502.09696
•
Published
•
38
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
Paper
•
2502.10391
•
Published
•
31
MLLMs Know Where to Look: Training-free Perception of Small Visual
Details with Multimodal LLMs
Paper
•
2502.17422
•
Published
•
7
Introducing Visual Perception Token into Multimodal Large Language Model
Paper
•
2502.17425
•
Published
•
14
KV-Edit: Training-Free Image Editing for Precise Background Preservation
Paper
•
2502.17363
•
Published
•
32
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference
Paper
•
2502.18411
•
Published
•
69
MMKE-Bench: A Multimodal Editing Benchmark for Diverse Visual Knowledge
Paper
•
2502.19870
•
Published
•
3
Multimodal Representation Alignment for Image Generation: Text-Image
Interleaved Control Is Easier Than You Think
Paper
•
2502.20172
•
Published
•
26
UniTok: A Unified Tokenizer for Visual Generation and Understanding
Paper
•
2502.20321
•
Published
•
27
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for
Multimodal Reasoning Models
Paper
•
2502.16033
•
Published
•
16
VLM^2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit
Matching Visual Cues
Paper
•
2502.12084
•
Published
•
29
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic
Understanding, Localization, and Dense Features
Paper
•
2502.14786
•
Published
•
128
LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in
Vision-Language Models
Paper
•
2502.14834
•
Published
•
24
Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision
Language Models
Paper
•
2502.14191
•
Published
•
7
Scaling Pre-training to One Hundred Billion Data for Vision Language
Models
Paper
•
2502.07617
•
Published
•
29
Qwen2.5-VL Technical Report
Paper
•
2502.13923
•
Published
•
157
Magma: A Foundation Model for Multimodal AI Agents
Paper
•
2502.13130
•
Published
•
55
YOLOv12: Attention-Centric Real-Time Object Detectors
Paper
•
2502.12524
•
Published
•
10
Ask in Any Modality: A Comprehensive Survey on Multimodal
Retrieval-Augmented Generation
Paper
•
2502.08826
•
Published
•
17
R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts
Paper
•
2502.20395
•
Published
•
43
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language
Models (VLMs) via Reinforcement Learning
Paper
•
2502.19634
•
Published
•
56
MIGE: A Unified Framework for Multimodal Instruction-Based Image
Generation and Editing
Paper
•
2502.21291
•
Published
•
4
Tell me why: Visual foundation models as self-explainable classifiers
Paper
•
2502.19577
•
Published
•
10