zerozeyi
's Collections
VisionLM
updated
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper
•
2402.04252
•
Published
•
28
Vision Superalignment: Weak-to-Strong Generalization for Vision
Foundation Models
Paper
•
2402.03749
•
Published
•
13
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper
•
2402.04615
•
Published
•
44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance
Loss
Paper
•
2402.05008
•
Published
•
23
WebLINX: Real-World Website Navigation with Multi-Turn Dialogue
Paper
•
2402.05930
•
Published
•
40
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large
Language Models
Paper
•
2402.05935
•
Published
•
17
ViGoR: Improving Visual Grounding of Large Vision Language Models with
Fine-Grained Reward Modeling
Paper
•
2402.06118
•
Published
•
15
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement
Paper
•
2402.07456
•
Published
•
45
PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs
Paper
•
2402.07872
•
Published
•
16
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned
Language Models
Paper
•
2402.07865
•
Published
•
15
World Model on Million-Length Video And Language With RingAttention
Paper
•
2402.08268
•
Published
•
39
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong
Vision-language Adapter
Paper
•
2402.10896
•
Published
•
16
FinTral: A Family of GPT-4 Level Multimodal Financial Large Language
Models
Paper
•
2402.10986
•
Published
•
80
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Paper
•
2402.12226
•
Published
•
44
CoLLaVO: Crayon Large Language and Vision mOdel
Paper
•
2402.11248
•
Published
•
23
Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning
Paper
•
2402.11690
•
Published
•
10
VideoPrism: A Foundational Visual Encoder for Video Understanding
Paper
•
2402.13217
•
Published
•
25
Video ReCap: Recursive Captioning of Hour-Long Videos
Paper
•
2402.13250
•
Published
•
27
A Touch, Vision, and Language Dataset for Multimodal Alignment
Paper
•
2402.13232
•
Published
•
15
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on
Deceptive Prompts
Paper
•
2402.13220
•
Published
•
15
BBA: Bi-Modal Behavioral Alignment for Reasoning with Large
Vision-Language Models
Paper
•
2402.13577
•
Published
•
10
PALO: A Polyglot Large Multimodal Model for 5B People
Paper
•
2402.14818
•
Published
•
25
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
Paper
•
2402.14289
•
Published
•
20
Sora: A Review on Background, Technology, Limitations, and Opportunities
of Large Vision Models
Paper
•
2402.17177
•
Published
•
89
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Paper
•
2402.19479
•
Published
•
35
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies
Paper
•
2403.01422
•
Published
•
30
InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding
Paper
•
2403.01487
•
Published
•
16
Finetuned Multimodal Language Models Are High-Quality Image-Text Data
Filters
Paper
•
2403.02677
•
Published
•
18
Modeling Collaborator: Enabling Subjective Vision Classification With
Minimal Human Effort via LLM Tool-Use
Paper
•
2403.02626
•
Published
•
11
MAGID: An Automated Pipeline for Generating Synthetic Multi-modal
Datasets
Paper
•
2403.03194
•
Published
•
14
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large
Language Models
Paper
•
2403.03003
•
Published
•
11
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper
•
2403.09611
•
Published
•
128
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper
•
2403.07508
•
Published
•
77
Synth^2: Boosting Visual-Language Models with Synthetic Captions and
Image Embeddings
Paper
•
2403.07750
•
Published
•
24
DragAnything: Motion Control for Anything using Entity Representation
Paper
•
2403.07420
•
Published
•
15
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference
Acceleration for Large Vision-Language Models
Paper
•
2403.06764
•
Published
•
29
VideoMamba: State Space Model for Efficient Video Understanding
Paper
•
2403.06977
•
Published
•
31
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Paper
•
2403.05135
•
Published
•
46
Gemini 1.5: Unlocking multimodal understanding across millions of tokens
of context
Paper
•
2403.05530
•
Published
•
65
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper
•
2403.05525
•
Published
•
45
VideoElevator: Elevating Video Generation Quality with Versatile
Text-to-Image Diffusion Models
Paper
•
2403.05438
•
Published
•
22
Uni-SMART: Universal Science Multimodal Analysis and Research
Transformer
Paper
•
2403.10301
•
Published
•
54
VideoAgent: Long-form Video Understanding with Large Language Model as
Agent
Paper
•
2403.10517
•
Published
•
36
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Paper
•
2403.11703
•
Published
•
17
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
Paper
•
2403.11481
•
Published
•
13
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document
Understanding
Paper
•
2403.12895
•
Published
•
33
Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs
Paper
•
2403.12596
•
Published
•
10
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual
Math Problems?
Paper
•
2403.14624
•
Published
•
53
Can large language models explore in-context?
Paper
•
2403.15371
•
Published
•
34
InternVideo2: Scaling Video Foundation Models for Multimodal Video
Understanding
Paper
•
2403.15377
•
Published
•
26
SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate
Time series
Paper
•
2403.15360
•
Published
•
13
VidLA: Video-Language Alignment at Scale
Paper
•
2403.14870
•
Published
•
14
ViTAR: Vision Transformer with Any Resolution
Paper
•
2403.18361
•
Published
•
56
Mini-Gemini: Mining the Potential of Multi-modality Vision Language
Models
Paper
•
2403.18814
•
Published
•
48
sDPO: Don't Use Your Data All at Once
Paper
•
2403.19270
•
Published
•
42
TextCraftor: Your Text Encoder Can be Image Quality Controller
Paper
•
2403.18978
•
Published
•
15
Unsolvable Problem Detection: Evaluating Trustworthiness of Vision
Language Models
Paper
•
2403.20331
•
Published
•
16
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Paper
•
2404.01197
•
Published
•
32
Direct Preference Optimization of Video Large Multimodal Models from
Language Model Reward
Paper
•
2404.01258
•
Published
•
12
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with
Interleaved Visual-Textual Tokens
Paper
•
2404.03413
•
Published
•
29
LVLM-Intrepret: An Interpretability Tool for Large Vision-Language
Models
Paper
•
2404.03118
•
Published
•
27
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept
Matching
Paper
•
2404.03653
•
Published
•
37
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Paper
•
2404.05719
•
Published
•
83
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video
Understanding
Paper
•
2404.05726
•
Published
•
23
MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation
Paper
•
2404.05674
•
Published
•
15
Koala: Key frame-conditioned long video-LLM
Paper
•
2404.04346
•
Published
•
7
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model
Handling Resolutions from 336 Pixels to 4K HD
Paper
•
2404.06512
•
Published
•
31
Adapting LLaMA Decoder to Vision Transformer
Paper
•
2404.06773
•
Published
•
18
BRAVE: Broadening the visual encoding of vision-language models
Paper
•
2404.07204
•
Published
•
19
Transferable and Principled Efficiency for Open-Vocabulary Segmentation
Paper
•
2404.07448
•
Published
•
12
Ferret-v2: An Improved Baseline for Referring and Grounding with Large
Language Models
Paper
•
2404.07973
•
Published
•
33
HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing
Paper
•
2404.09990
•
Published
•
13
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal
Large Language Models
Paper
•
2404.09204
•
Published
•
11
On Speculative Decoding for Multimodal Large Language Models
Paper
•
2404.08856
•
Published
•
14
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language
Models
Paper
•
2404.12387
•
Published
•
40
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper
•
2404.12390
•
Published
•
27
MultiBooth: Towards Generating All Your Concepts in an Image from Text
Paper
•
2404.14239
•
Published
•
9
A Multimodal Automated Interpretability Agent
Paper
•
2404.14394
•
Published
•
22
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Paper
•
2404.12803
•
Published
•
31
Groma: Localized Visual Tokenization for Grounding Multimodal Large
Language Models
Paper
•
2404.13013
•
Published
•
32
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster
Pre-training on Web-scale Image-Text Data
Paper
•
2404.15653
•
Published
•
29
Editable Image Elements for Controllable Synthesis
Paper
•
2404.16029
•
Published
•
11
MoDE: CLIP Data Experts via Clustering
Paper
•
2404.16030
•
Published
•
14
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with
Text-Rich Visual Comprehension
Paper
•
2404.16790
•
Published
•
9
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal
Models with Open-Source Suites
Paper
•
2404.16821
•
Published
•
60
List Items One by One: A New Data Source and Learning Paradigm for
Multimodal LLMs
Paper
•
2404.16375
•
Published
•
18
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video
Dense Captioning
Paper
•
2404.16994
•
Published
•
37
HaLo-NeRF: Learning Geometry-Guided Semantics for Exploring
Unconstrained Photo Collections
Paper
•
2404.16845
•
Published
•
7
BlenderAlchemy: Editing 3D Graphics with Vision-Language Models
Paper
•
2404.17672
•
Published
•
20
Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual
and Action Representations
Paper
•
2404.17521
•
Published
•
13
Automatic Creative Selection with Cross-Modal Matching
Paper
•
2405.00029
•
Published
•
9
What matters when building vision-language models?
Paper
•
2405.02246
•
Published
•
104
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large
Language Models in Code Generation from Scientific Plots
Paper
•
2405.07990
•
Published
•
21
No Time to Waste: Squeeze Time into Channel for Mobile Video
Understanding
Paper
•
2405.08344
•
Published
•
16
Understanding the performance gap between online and offline alignment
algorithms
Paper
•
2405.08448
•
Published
•
20
SpeechVerse: A Large-scale Generalizable Audio Language Model
Paper
•
2405.08295
•
Published
•
20
SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large
Language Models
Paper
•
2405.08317
•
Published
•
13
Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model
Paper
•
2405.09215
•
Published
•
23
LoRA Learns Less and Forgets Less
Paper
•
2405.09673
•
Published
•
89
Many-Shot In-Context Learning in Multimodal Foundation Models
Paper
•
2405.09798
•
Published
•
32
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper
•
2405.09818
•
Published
•
132
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Paper
•
2405.10300
•
Published
•
30
Toon3D: Seeing Cartoons from a New Perspective
Paper
•
2405.10320
•
Published
•
23
Octo: An Open-Source Generalist Robot Policy
Paper
•
2405.12213
•
Published
•
30
Imp: Highly Capable Large Multimodal Models for Mobile Devices
Paper
•
2405.12107
•
Published
•
30
Your Transformer is Secretly Linear
Paper
•
2405.12250
•
Published
•
159
Diffusion for World Modeling: Visual Details Matter in Atari
Paper
•
2405.12399
•
Published
•
31
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment
Capability
Paper
•
2405.14129
•
Published
•
14
CamViG: Camera Aware Image-to-Video Generation with Multimodal
Transformers
Paper
•
2405.13195
•
Published
•
12
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision
Models
Paper
•
2405.15574
•
Published
•
56
Denoising LM: Pushing the Limits of Error Correction Models for Speech
Recognition
Paper
•
2405.15216
•
Published
•
17
An Introduction to Vision-Language Modeling
Paper
•
2405.17247
•
Published
•
90
Matryoshka Multimodal Models
Paper
•
2405.17430
•
Published
•
34
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding
Models
Paper
•
2405.17428
•
Published
•
20
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal
Models
Paper
•
2405.15738
•
Published
•
47
Dense Connector for MLLMs
Paper
•
2405.13800
•
Published
•
25
Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation
Paper
•
2405.14598
•
Published
•
14
Jina CLIP: Your CLIP Model Is Also Your Text Retriever
Paper
•
2405.20204
•
Published
•
37
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities
Paper
•
2405.18669
•
Published
•
12
MotionLLM: Understanding Human Behaviors from Human Motions and Videos
Paper
•
2405.20340
•
Published
•
21
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of
Multi-modal LLMs in Video Analysis
Paper
•
2405.21075
•
Published
•
24
Show, Don't Tell: Aligning Language Models with Demonstrated Feedback
Paper
•
2406.00888
•
Published
•
34
Parrot: Multilingual Visual Instruction Tuning
Paper
•
2406.02539
•
Published
•
39
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with
LLM
Paper
•
2406.02884
•
Published
•
18
ShareGPT4Video: Improving Video Understanding and Generation with Better
Captions
Paper
•
2406.04325
•
Published
•
76
AgentGym: Evolving Large Language Model-based Agents across Diverse
Environments
Paper
•
2406.04151
•
Published
•
20
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective
Navigation via Multi-Agent Collaboration
Paper
•
2406.01014
•
Published
•
35
Vript: A Video Is Worth Thousands of Words
Paper
•
2406.06040
•
Published
•
30
An Image is Worth 32 Tokens for Reconstruction and Generation
Paper
•
2406.07550
•
Published
•
60
AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising
Paper
•
2406.06911
•
Published
•
12
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio
Understanding in Video-LLMs
Paper
•
2406.07476
•
Published
•
38
What If We Recaption Billions of Web Images with LLaMA-3?
Paper
•
2406.08478
•
Published
•
42
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation
in Videos
Paper
•
2406.08407
•
Published
•
29
Needle In A Multimodal Haystack
Paper
•
2406.07230
•
Published
•
55
mDPO: Conditional Preference Optimization for Multimodal Large Language
Models
Paper
•
2406.11839
•
Published
•
39
VideoLLM-online: Online Video Large Language Model for Streaming Video
Paper
•
2406.11816
•
Published
•
25
TroL: Traversal of Layers for Large Language and Vision Models
Paper
•
2406.12246
•
Published
•
36
VoCo-LLaMA: Towards Vision Compression with Large Language Models
Paper
•
2406.12275
•
Published
•
32
Benchmarking Multi-Image Understanding in Vision and Language Models:
Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
Paper
•
2406.12742
•
Published
•
15
Adversarial Attacks on Multimodal Agents
Paper
•
2406.12814
•
Published
•
4
Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of
Multimodal Large Language Models
Paper
•
2406.11230
•
Published
•
35
Probabilistic Conceptual Explainers: Trustworthy Conceptual Explanations
for Vision Foundation Models
Paper
•
2406.12649
•
Published
•
16
Understanding Hallucinations in Diffusion Models through Mode
Interpolation
Paper
•
2406.09358
•
Published
•
5
CMC-Bench: Towards a New Paradigm of Visual Signal Compression
Paper
•
2406.09356
•
Published
•
5
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
Paper
•
2406.09406
•
Published
•
15
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal
Language Models
Paper
•
2406.09403
•
Published
•
22
MuirBench: A Comprehensive Benchmark for Robust Multi-image
Understanding
Paper
•
2406.09411
•
Published
•
20
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Paper
•
2406.08707
•
Published
•
16
EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal
Prompts
Paper
•
2406.09162
•
Published
•
14
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images
Interleaved with Text
Paper
•
2406.08418
•
Published
•
30
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on
Mobile Devices
Paper
•
2406.08451
•
Published
•
26
Paper
•
2406.04127
•
Published
•
40
NaRCan: Natural Refined Canonical Image with Integration of Diffusion
Prior for Video Editing
Paper
•
2406.06523
•
Published
•
53
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
Paper
•
2406.08487
•
Published
•
14
VCR: Visual Caption Restoration
Paper
•
2406.06462
•
Published
•
13
An Image is Worth More Than 16x16 Patches: Exploring Transformers on
Individual Pixels
Paper
•
2406.09415
•
Published
•
52
OpenVLA: An Open-Source Vision-Language-Action Model
Paper
•
2406.09246
•
Published
•
39
DiTFastAttn: Attention Compression for Diffusion Transformer Models
Paper
•
2406.08552
•
Published
•
26
Physics3D: Learning Physical Properties of 3D Gaussians via Video
Diffusion
Paper
•
2406.04338
•
Published
•
40
Hibou: A Family of Foundational Vision Transformers for Pathology
Paper
•
2406.05074
•
Published
•
9
Make It Count: Text-to-Image Generation with an Accurate Number of
Objects
Paper
•
2406.10210
•
Published
•
78
XLand-100B: A Large-Scale Multi-Task Dataset for In-Context
Reinforcement Learning
Paper
•
2406.08973
•
Published
•
90
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and
Instruction-Tuning Dataset for LVLMs
Paper
•
2406.11833
•
Published
•
64
Exploring the Role of Large Language Models in Prompt Encoding for
Diffusion Models
Paper
•
2406.11831
•
Published
•
22
From Pixels to Prose: A Large Dataset of Dense Image Captions
Paper
•
2406.10328
•
Published
•
18
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs
Paper
•
2406.14544
•
Published
•
36
WildVision: Evaluating Vision-Language Models in the Wild with Human
Preferences
Paper
•
2406.11069
•
Published
•
14
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal
Dataset with One Trillion Tokens
Paper
•
2406.11271
•
Published
•
21
Paper
•
2406.11775
•
Published
•
8
Unifying Multimodal Retrieval via Document Screenshot Embedding
Paper
•
2406.11251
•
Published
•
10
The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN
Inversion and High Quality Image Editing
Paper
•
2406.10601
•
Published
•
70
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video
Understanding
Paper
•
2406.14515
•
Published
•
34
Two Giraffes in a Dirt Field: Using Game Play to Investigate Situation
Modelling in Large Multimodal Models
Paper
•
2406.14035
•
Published
•
13
ICAL: Continual Learning of Multimodal Agents by Transforming
Trajectories into Actionable Insights
Paper
•
2406.14596
•
Published
•
5
Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical
Report
Paper
•
2406.11403
•
Published
•
4
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in
Large Video-Language Models
Paper
•
2406.16338
•
Published
•
27
Long Context Transfer from Language to Vision
Paper
•
2406.16852
•
Published
•
34
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Paper
•
2406.16860
•
Published
•
61
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
Paper
•
2406.17770
•
Published
•
19
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
Paper
•
2406.15704
•
Published
•
5
Octo-planner: On-device Language Model for Planner-Action Agents
Paper
•
2406.18082
•
Published
•
49
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal
LLMs
Paper
•
2406.18521
•
Published
•
30
Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
Paper
•
2406.15334
•
Published
•
9
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large
Language Models
Paper
•
2406.17294
•
Published
•
11
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and
Understanding
Paper
•
2406.19389
•
Published
•
55
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of
LLMs
Paper
•
2406.18629
•
Published
•
43
MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data
Paper
•
2406.18790
•
Published
•
35
Simulating Classroom Education with LLM-Empowered Agents
Paper
•
2406.19226
•
Published
•
32
AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for
Vision-Language Models
Paper
•
2406.10900
•
Published
•
11
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Paper
•
2406.20095
•
Published
•
18
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything
Model
Paper
•
2406.20076
•
Published
•
10
Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity
Paper
•
2406.17720
•
Published
•
8
We-Math: Does Your Large Multimodal Model Achieve Human-like
Mathematical Reasoning?
Paper
•
2407.01284
•
Published
•
81
ROS-LLM: A ROS framework for embodied AI with task feedback and
structured reasoning
Paper
•
2406.19741
•
Published
•
63
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and
Efficient Evaluation
Paper
•
2407.00468
•
Published
•
37
ColPali: Efficient Document Retrieval with Vision Language Models
Paper
•
2407.01449
•
Published
•
47
OmniJARVIS: Unified Vision-Language-Action Tokenization Enables
Open-World Instruction Following Agents
Paper
•
2407.00114
•
Published
•
13
Understanding Alignment in Multimodal LLMs: A Comprehensive Study
Paper
•
2407.02477
•
Published
•
23
InternLM-XComposer-2.5: A Versatile Large Vision Language Model
Supporting Long-Contextual Input and Output
Paper
•
2407.03320
•
Published
•
96
TokenPacker: Efficient Visual Projector for Multimodal LLM
Paper
•
2407.02392
•
Published
•
23
Unveiling Encoder-Free Vision-Language Models
Paper
•
2406.11832
•
Published
•
55
Flash-VStream: Memory-Based Real-Time Understanding for Long Video
Streams
Paper
•
2406.08085
•
Published
•
17
Granular Privacy Control for Geolocation with Vision Language Models
Paper
•
2407.04952
•
Published
•
7
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for
Interleaved Image-Text Generation
Paper
•
2407.06135
•
Published
•
23
Multi-Object Hallucination in Vision-Language Models
Paper
•
2407.06192
•
Published
•
12
Vision language models are blind
Paper
•
2407.06581
•
Published
•
83
VIMI: Grounding Video Generation through Multi-modal Instruction
Paper
•
2407.06304
•
Published
•
10
Video-to-Audio Generation with Hidden Alignment
Paper
•
2407.07464
•
Published
•
17
Stark: Social Long-Term Multi-Modal Conversation with Persona
Commonsense Knowledge
Paper
•
2407.03958
•
Published
•
22
Understanding Visual Feature Reliance through the Lens of Complexity
Paper
•
2407.06076
•
Published
•
7
Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting
Region Captions
Paper
•
2407.06723
•
Published
•
11
PaliGemma: A versatile 3B VLM for transfer
Paper
•
2407.07726
•
Published
•
71
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large
Multimodal Models
Paper
•
2407.07895
•
Published
•
43
Do Vision and Language Models Share Concepts? A Vector Space Alignment
Study
Paper
•
2302.06555
•
Published
•
9
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal
Perception
Paper
•
2407.08303
•
Published
•
19
The Synergy between Data and Multi-Modal Large Language Models: A Survey
from Co-Development Perspective
Paper
•
2407.08583
•
Published
•
13
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning
Instruction Using Language Model
Paper
•
2407.07053
•
Published
•
47
E5-V: Universal Embeddings with Multimodal Large Language Models
Paper
•
2407.12580
•
Published
•
41
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
Paper
•
2407.12679
•
Published
•
8
AUITestAgent: Automatic Requirements Oriented GUI Function Testing
Paper
•
2407.09018
•
Published
•
5
ThinkGrasp: A Vision-Language System for Strategic Part Grasping in
Clutter
Paper
•
2407.11298
•
Published
•
5
NavGPT-2: Unleashing Navigational Reasoning Capability for Large
Vision-Language Models
Paper
•
2407.12366
•
Published
•
4
Benchmarking Trustworthiness of Multimodal Large Language Models: A
Comprehensive Study
Paper
•
2406.07057
•
Published
•
17
EVLM: An Efficient Vision-Language Model for Visual Understanding
Paper
•
2407.14177
•
Published
•
45
VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document
Understanding
Paper
•
2407.12594
•
Published
•
19
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language
Models
Paper
•
2407.15841
•
Published
•
41
VideoGameBunny: Towards vision assistants for video games
Paper
•
2407.15295
•
Published
•
22
CGB-DM: Content and Graphic Balance Layout Generation with
Transformer-based Diffusion Model
Paper
•
2407.15233
•
Published
•
6
OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any
Person
Paper
•
2407.16224
•
Published
•
29
MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence
Paper
•
2407.16655
•
Published
•
31
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal
Large Language Model
Paper
•
2407.16198
•
Published
•
13
VILA^2: VILA Augmented VILA
Paper
•
2407.17453
•
Published
•
42
Learning to Manipulate Anywhere: A Visual Generalizable Framework For
Reinforcement Learning
Paper
•
2407.15815
•
Published
•
14
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents
Paper
•
2407.17490
•
Published
•
32
Efficient Inference of Vision Instruction-Following Models with Elastic
Cache
Paper
•
2407.18121
•
Published
•
17
VSSD: Vision Mamba with Non-Casual State Space Duality
Paper
•
2407.18559
•
Published
•
19
Wolf: Captioning Everything with a World Summarization Framework
Paper
•
2407.18908
•
Published
•
33
Diffusion Feedback Helps CLIP See Better
Paper
•
2407.20171
•
Published
•
37
VolDoGer: LLM-assisted Datasets for Domain Generalization in
Vision-Language Tasks
Paper
•
2407.19795
•
Published
•
11
Mixture of Nested Experts: Adaptive Processing of Visual Tokens
Paper
•
2407.19985
•
Published
•
37
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware
Experts
Paper
•
2407.21770
•
Published
•
23
Towards Achieving Human Parity on End-to-end Simultaneous Speech
Translation via LLM Agent
Paper
•
2407.21646
•
Published
•
18
ShieldGemma: Generative AI Content Moderation Based on Gemma
Paper
•
2407.21772
•
Published
•
14
Open-Vocabulary Audio-Visual Semantic Segmentation
Paper
•
2407.21721
•
Published
•
9
SAM 2: Segment Anything in Images and Videos
Paper
•
2408.00714
•
Published
•
115
OmniParser for Pure Vision Based GUI Agent
Paper
•
2408.00203
•
Published
•
26
Generalized Out-of-Distribution Detection and Beyond in Vision Language
Model Era: A Survey
Paper
•
2407.21794
•
Published
•
6
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Paper
•
2408.01800
•
Published
•
83
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation
with Multimodal Generative Pretraining
Paper
•
2408.02657
•
Published
•
36
Language Model Can Listen While Speaking
Paper
•
2408.02622
•
Published
•
42
ExoViP: Step-by-step Verification and Exploration with Exoskeleton
Modules for Compositional Visual Reasoning
Paper
•
2408.02210
•
Published
•
9
Operationalizing Contextual Integrity in Privacy-Conscious Assistants
Paper
•
2408.02373
•
Published
•
5
LLaVA-OneVision: Easy Visual Task Transfer
Paper
•
2408.03326
•
Published
•
61
Diffusion Models as Data Mining Tools
Paper
•
2408.02752
•
Published
•
14
AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual
Segmentation
Paper
•
2408.01708
•
Published
•
4
Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in
Long-Horizon Tasks
Paper
•
2408.03615
•
Published
•
32
Openstory++: A Large-scale Dataset and Benchmark for Instance-aware
Open-domain Visual Storytelling
Paper
•
2408.03695
•
Published
•
13
Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond
Paper
•
2408.03900
•
Published
•
10
Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from
User's Casual Sketches
Paper
•
2408.04567
•
Published
•
27
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language
Models
Paper
•
2408.04594
•
Published
•
15
Puppet-Master: Scaling Interactive Video Generation as a Motion Prior
for Part-Level Dynamics
Paper
•
2408.04631
•
Published
•
10
VITA: Towards Open-Source Interactive Omni Multimodal LLM
Paper
•
2408.05211
•
Published
•
49
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal
Large Language Models
Paper
•
2408.04840
•
Published
•
35
UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond
Scaling
Paper
•
2408.04810
•
Published
•
25
ControlNeXt: Powerful and Efficient Control for Image and Video
Generation
Paper
•
2408.06070
•
Published
•
54
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation
Agents
Paper
•
2408.06327
•
Published
•
17
UniPortrait: A Unified Framework for Identity-Preserving Single- and
Multi-Human Image Personalization
Paper
•
2408.05939
•
Published
•
15
Paper
•
2408.07009
•
Published
•
62
Amuro & Char: Analyzing the Relationship between Pre-Training and
Fine-Tuning of Large Language Models
Paper
•
2408.06663
•
Published
•
16
Paper
•
2408.05366
•
Published
•
13
Towards flexible perception with visual memory
Paper
•
2408.08172
•
Published
•
24
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Paper
•
2408.08872
•
Published
•
101
JPEG-LM: LLMs as Image Generators with Canonical Codec Representations
Paper
•
2408.08459
•
Published
•
46
D5RL: Diverse Datasets for Data-Driven Deep Reinforcement Learning
Paper
•
2408.08441
•
Published
•
8
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Paper
•
2408.10188
•
Published
•
53
MegaFusion: Extend Diffusion Models towards Higher-resolution Image
Generation without Further Tuning
Paper
•
2408.11001
•
Published
•
12
Factorized-Dreamer: Training A High-Quality Video Generator with Limited
and Low-Quality Data
Paper
•
2408.10119
•
Published
•
17
Transfusion: Predict the Next Token and Diffuse Images with One
Multi-Modal Model
Paper
•
2408.11039
•
Published
•
61
NeCo: Improving DINOv2's spatial representations in 19 GPU hours with
Patch Neighbor Consistency
Paper
•
2408.11054
•
Published
•
13
Predicting Rewards Alongside Tokens: Non-disruptive Parameter Insertion
for Efficient Inference Intervention in Large Language Model
Paper
•
2408.10764
•
Published
•
9
Audio Match Cutting: Finding and Creating Matching Audio Transitions in
Movies and Videos
Paper
•
2408.10998
•
Published
•
9
MambaEVT: Event Stream based Visual Object Tracking using State Space
Model
Paper
•
2408.10487
•
Published
•
7
FocusLLM: Scaling LLM's Context by Parallel Decoding
Paper
•
2408.11745
•
Published
•
26
TWLV-I: Analysis and Insights from Holistic Evaluation on Video
Foundation Models
Paper
•
2408.11318
•
Published
•
57
GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models
Paper
•
2408.11817
•
Published
•
9
FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive
Prompt Weighting
Paper
•
2408.11706
•
Published
•
7
TrackGo: A Flexible and Efficient Method for Controllable Video
Generation
Paper
•
2408.11475
•
Published
•
18
Out-of-Distribution Detection with Attention Head Masking for Multimodal
Document Classification
Paper
•
2408.11237
•
Published
•
6
Iterative Object Count Optimization for Text-to-image Diffusion Models
Paper
•
2408.11721
•
Published
•
6
Sapiens: Foundation for Human Vision Models
Paper
•
2408.12569
•
Published
•
92
Show-o: One Single Transformer to Unify Multimodal Understanding and
Generation
Paper
•
2408.12528
•
Published
•
52
Open-FinLLMs: Open Multimodal Large Language Models for Financial
Applications
Paper
•
2408.11878
•
Published
•
60
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed
Representations
Paper
•
2408.12590
•
Published
•
37
Scalable Autoregressive Image Generation with Mamba
Paper
•
2408.12245
•
Published
•
27
Real-Time Video Generation with Pyramid Attention Broadcast
Paper
•
2408.12588
•
Published
•
16
SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for
Large-scale Vision-Language Models
Paper
•
2408.12114
•
Published
•
14
Anim-Director: A Large Multimodal Model Powered Agent for Controllable
Animation Video Generation
Paper
•
2408.09787
•
Published
•
8
Building and better understanding vision-language models: insights and
future directions
Paper
•
2408.12637
•
Published
•
130
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution
Real-World Scenarios that are Difficult for Humans?
Paper
•
2408.13257
•
Published
•
27
CustomCrafter: Customized Video Generation with Preserving Motion and
Concept Composition Abilities
Paper
•
2408.13239
•
Published
•
12
Foundation Models for Music: A Survey
Paper
•
2408.14340
•
Published
•
45
LLaVaOLMoBitnet1B: Ternary LLM goes Multimodal!
Paper
•
2408.13402
•
Published
•
18
TVG: A Training-free Transition Video Generation Method with Diffusion
Models
Paper
•
2408.13413
•
Published
•
14
BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and
Deduplication by Introducing a Competitive Large Language Model Baseline
Paper
•
2408.15079
•
Published
•
55
Law of Vision Representation in MLLMs
Paper
•
2408.16357
•
Published
•
96
CogVLM2: Visual Language Models for Image and Video Understanding
Paper
•
2408.16500
•
Published
•
58
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio
Language Modeling
Paper
•
2408.16532
•
Published
•
51
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
Paper
•
2408.16725
•
Published
•
54
VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time
Series Forecasters
Paper
•
2408.17253
•
Published
•
40
TableBench: A Comprehensive and Complex Benchmark for Table Question
Answering
Paper
•
2408.09174
•
Published
•
53
VideoLLaMB: Long-context Video Understanding with Recurrent Memory
Bridges
Paper
•
2409.01071
•
Published
•
28
DepthCrafter: Generating Consistent Long Depth Sequences for Open-world
Videos
Paper
•
2409.02095
•
Published
•
37
LinFusion: 1 GPU, 1 Minute, 16K Image
Paper
•
2409.02097
•
Published
•
35
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via
Hybrid Architecture
Paper
•
2409.02889
•
Published
•
55
Attention Heads of Large Language Models: A Survey
Paper
•
2409.03752
•
Published
•
90
Open-MAGVIT2: An Open-Source Project Toward Democratizing
Auto-regressive Visual Generation
Paper
•
2409.04410
•
Published
•
26
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
Paper
•
2409.05840
•
Published
•
49
Towards a Unified View of Preference Learning for Large Language Models:
A Survey
Paper
•
2409.02795
•
Published
•
73
POINTS: Improving Your Vision-language Model with Affordable Strategies
Paper
•
2409.04828
•
Published
•
25
Benchmarking Chinese Knowledge Rectification in Large Language Models
Paper
•
2409.05806
•
Published
•
15
LLaMA-Omni: Seamless Speech Interaction with Large Language Models
Paper
•
2409.06666
•
Published
•
58
Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis
Paper
•
2409.06135
•
Published
•
16
PingPong: A Benchmark for Role-Playing Language Models with User
Emulation and Multi-Model Evaluation
Paper
•
2409.06820
•
Published
•
69
MVLLaVA: An Intelligent Agent for Unified and Flexible Novel View
Synthesis
Paper
•
2409.07129
•
Published
•
8
PiTe: Pixel-Temporal Alignment for Large Video-Language Model
Paper
•
2409.07239
•
Published
•
14
Ferret: Federated Full-Parameter Tuning at Scale for Large Language
Models
Paper
•
2409.06277
•
Published
•
16
Guiding Vision-Language Model Selection for Visual Question-Answering
Across Tasks, Domains, and Knowledge Types
Paper
•
2409.09269
•
Published
•
9
One missing piece in Vision and Language: A Survey on Comics
Understanding
Paper
•
2409.09502
•
Published
•
26
NVLM: Open Frontier-Class Multimodal LLMs
Paper
•
2409.11402
•
Published
•
75
OmniGen: Unified Image Generation
Paper
•
2409.11340
•
Published
•
115
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think
Paper
•
2409.11355
•
Published
•
31
OSV: One Step is Enough for High-Quality Image to Video Generation
Paper
•
2409.11367
•
Published
•
14
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page
Document Understanding
Paper
•
2409.03420
•
Published
•
27
InstantDrag: Improving Interactivity in Drag-based Image Editing
Paper
•
2409.08857
•
Published
•
34
AudioBERT: Audio Knowledge Augmented Language Model
Paper
•
2409.08199
•
Published
•
5
LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study
Paper
•
2409.08554
•
Published
•
3
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at
Any Resolution
Paper
•
2409.12191
•
Published
•
78
Qwen2.5-Coder Technical Report
Paper
•
2409.12186
•
Published
•
148
Preference Tuning with Human Feedback on Language, Speech, and Vision
Tasks: A Survey
Paper
•
2409.11564
•
Published
•
21
Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models
Paper
•
2409.12139
•
Published
•
12
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary
Resolution
Paper
•
2409.12961
•
Published
•
26
StoryMaker: Towards Holistic Consistent Characters in Text-to-image
Generation
Paper
•
2409.12576
•
Published
•
16
Imagine yourself: Tuning-Free Personalized Image Generation
Paper
•
2409.13346
•
Published
•
70
YesBut: A High-Quality Annotated Multimodal Dataset for evaluating
Satire Comprehension capability of Vision-Language Models
Paper
•
2409.13592
•
Published
•
52
Portrait Video Editing Empowered by Multimodal Generative Priors
Paper
•
2409.13591
•
Published
•
17
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language
Instructions
Paper
•
2409.15278
•
Published
•
26
Phantom of Latent for Large Language and Vision Models
Paper
•
2409.14713
•
Published
•
30
Reflecting Reality: Enabling Diffusion Models to Produce Faithful Mirror
Reflections
Paper
•
2409.14677
•
Published
•
16
MIMO: Controllable Character Video Synthesis with Spatial Decomposed
Modeling
Paper
•
2409.16160
•
Published
•
34
MonoFormer: One Transformer for Both Diffusion and Autoregression
Paper
•
2409.16280
•
Published
•
18
Seeing Faces in Things: A Model and Dataset for Pareidolia
Paper
•
2409.16143
•
Published
•
17
Attention Prompting on Image for Large Vision-Language Models
Paper
•
2409.17143
•
Published
•
7
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art
Multimodal Models
Paper
•
2409.17146
•
Published
•
114
MIO: A Foundation Model on Multimodal Tokens
Paper
•
2409.17692
•
Published
•
54
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
Paper
•
2409.20566
•
Published
•
57
Visual Question Decomposition on Multimodal Large Language Models
Paper
•
2409.19339
•
Published
•
9
Loong: Generating Minute-level Long Videos with Autoregressive Language
Models
Paper
•
2410.02757
•
Published
•
37
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal
Foundation Models
Paper
•
2410.02740
•
Published
•
54
LLaVA-Critic: Learning to Evaluate Multimodal Models
Paper
•
2410.02712
•
Published
•
36
Interpreting and Editing Vision-Language Representations to Mitigate
Hallucinations
Paper
•
2410.02762
•
Published
•
9
Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short
Videos
Paper
•
2410.02763
•
Published
•
7
Addition is All You Need for Energy-efficient Language Models
Paper
•
2410.00907
•
Published
•
150
VideoGuide: Improving Video Diffusion Models without Training Through a
Teacher's Guide
Paper
•
2410.04364
•
Published
•
30
Navigating the Digital World as Humans Do: Universal Visual Grounding
for GUI Agents
Paper
•
2410.05243
•
Published
•
19
UniMuMo: Unified Text, Music and Motion Generation
Paper
•
2410.04534
•
Published
•
19
TLDR: Token-Level Detective Reward Model for Large Vision Language
Models
Paper
•
2410.04734
•
Published
•
17
OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal
Instruction
Paper
•
2410.04932
•
Published
•
9
A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive
Transformer for Efficient Finegrained Image Generation
Paper
•
2410.01912
•
Published
•
14
ControlAR: Controllable Image Generation with Autoregressive Models
Paper
•
2410.02705
•
Published
•
11
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video
Large Language Models
Paper
•
2410.03290
•
Published
•
7
Aria: An Open Multimodal Native Mixture-of-Experts Model
Paper
•
2410.05993
•
Published
•
112
Personalized Visual Instruction Tuning
Paper
•
2410.07113
•
Published
•
71
Paper
•
2410.07073
•
Published
•
66
IterComp: Iterative Composition-Aware Feedback Learning from Model
Gallery for Text-to-Image Generation
Paper
•
2410.07171
•
Published
•
44
Deciphering Cross-Modal Alignment in Large Vision-Language Models with
Modality Integration Rate
Paper
•
2410.07167
•
Published
•
40
Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation
Learning
Paper
•
2410.06373
•
Published
•
34
Pyramidal Flow Matching for Efficient Video Generative Modeling
Paper
•
2410.05954
•
Published
•
40
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark
for Video Generation
Paper
•
2410.05363
•
Published
•
46
Story-Adapter: A Training-free Iterative Framework for Long Story
Visualization
Paper
•
2410.06244
•
Published
•
19
MM-Ego: Towards Building Egocentric Multimodal LLMs
Paper
•
2410.07177
•
Published
•
22
TweedieMix: Improving Multi-Concept Fusion for Diffusion-based
Image/Video Generation
Paper
•
2410.05591
•
Published
•
13
Temporal Reasoning Transfer from Text to Video
Paper
•
2410.06166
•
Published
•
13
MLLM as Retriever: Interactively Learning Multimodal Retrieval for
Embodied Agents
Paper
•
2410.03450
•
Published
•
37
Intriguing Properties of Large Language and Vision Models
Paper
•
2410.04751
•
Published
•
16
Progressive Autoregressive Video Diffusion Models
Paper
•
2410.08151
•
Published
•
16
Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving
Vision-Linguistic Compositionality
Paper
•
2410.05210
•
Published
•
10
Self-Boosting Large Language Models with Synthetic Preference Data
Paper
•
2410.06961
•
Published
•
17
WALL-E: World Alignment by Rule Learning Improves World Model-based LLM
Agents
Paper
•
2410.07484
•
Published
•
50
Agent S: An Open Agentic Framework that Uses Computers Like a Human
Paper
•
2410.08164
•
Published
•
24
GLOV: Guided Large Language Models as Implicit Optimizers for Vision
Language Models
Paper
•
2410.06154
•
Published
•
16
Baichuan-Omni Technical Report
Paper
•
2410.08565
•
Published
•
89
From Generalist to Specialist: Adapting Vision Language Models via
Task-Specific Visual Instruction Tuning
Paper
•
2410.06456
•
Published
•
38
EvolveDirector: Approaching Advanced Text-to-Image Generation with Large
Vision-Language Models
Paper
•
2410.07133
•
Published
•
19
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large
Vision-Language Models
Paper
•
2410.10139
•
Published
•
53
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality
Documents
Paper
•
2410.10594
•
Published
•
27
MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation
Paper
•
2410.11779
•
Published
•
27
LVD-2M: A Long-take Video Dataset with Temporally Dense Captions
Paper
•
2410.10816
•
Published
•
21
Improving Long-Text Alignment for Text-to-Image Diffusion Models
Paper
•
2410.11817
•
Published
•
15
OMCAT: Omni Context Aware Transformer
Paper
•
2410.12109
•
Published
•
4
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for
Embodied AI
Paper
•
2410.11623
•
Published
•
49
HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex
Diagrams in Coding Tasks
Paper
•
2410.12381
•
Published
•
45
The Curse of Multi-Modalities: Evaluating Hallucinations of Large
Multimodal Models across Language, Visual, and Audio
Paper
•
2410.12787
•
Published
•
32
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding
and Generation
Paper
•
2410.13848
•
Published
•
35
Harnessing Webpage UIs for Text-Rich Visual Understanding
Paper
•
2410.13824
•
Published
•
32
WorldCuisines: A Massive-Scale Benchmark for Multilingual and
Multicultural Visual Question Answering on Global Cuisines
Paper
•
2410.12705
•
Published
•
33
Fluid: Scaling Autoregressive Text-to-image Generative Models with
Continuous Tokens
Paper
•
2410.13863
•
Published
•
38
MobA: A Two-Level Agent System for Efficient Mobile Task Automation
Paper
•
2410.13757
•
Published
•
33
Roadmap towards Superhuman Speech Understanding using Large Language
Models
Paper
•
2410.13268
•
Published
•
35
Movie Gen: A Cast of Media Foundation Models
Paper
•
2410.13720
•
Published
•
98
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise
Motion Control
Paper
•
2410.13830
•
Published
•
25
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language
Models
Paper
•
2410.13085
•
Published
•
22
A Comparative Study on Reasoning Patterns of OpenAI's o1 Model
Paper
•
2410.13639
•
Published
•
19
VidPanos: Generative Panoramic Videos from Casual Panning Videos
Paper
•
2410.13832
•
Published
•
13
Remember, Retrieve and Generate: Understanding Infinite Visual Concepts
as Your Personalized Assistant
Paper
•
2410.13360
•
Published
•
9
γ-MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large
Language Models
Paper
•
2410.13859
•
Published
•
8
Can MLLMs Understand the Deep Implication Behind Chinese Images?
Paper
•
2410.13854
•
Published
•
11
FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion
Model
Paper
•
2410.13925
•
Published
•
24
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex
Capabilities
Paper
•
2410.11190
•
Published
•
22
SemiEvol: Semi-supervised Fine-tuning for LLM Adaptation
Paper
•
2410.14745
•
Published
•
48
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a
Training-Free Memory Tree
Paper
•
2410.16268
•
Published
•
69
Baichuan Alignment Technical Report
Paper
•
2410.14940
•
Published
•
52
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Paper
•
2410.13861
•
Published
•
57
Toward Guidance-Free AR Visual Generation via Condition Contrastive
Alignment
Paper
•
2410.09347
•
Published
•
5
AutoTrain: No-code training for state-of-the-art models
Paper
•
2410.15735
•
Published
•
60
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety
and Style
Paper
•
2410.16184
•
Published
•
24
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
Paper
•
2410.15316
•
Published
•
10
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid
Visual Redundancy Reduction
Paper
•
2410.17247
•
Published
•
48
Aligning Large Language Models via Self-Steering Optimization
Paper
•
2410.17131
•
Published
•
23
Improve Vision Language Model Chain-of-thought Reasoning
Paper
•
2410.16198
•
Published
•
27
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video
Even in VLMs
Paper
•
2410.16267
•
Published
•
18
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large
Vision-Language Models
Paper
•
2410.17637
•
Published
•
37
Can Knowledge Editing Really Correct Hallucinations?
Paper
•
2410.16251
•
Published
•
56
LOGO -- Long cOntext aliGnment via efficient preference Optimization
Paper
•
2410.18533
•
Published
•
44
Distill Visual Chart Reasoning Ability from LLMs to MLLMs
Paper
•
2410.18798
•
Published
•
21
Infinity-MM: Scaling Multimodal Performance with Large-Scale and
High-Quality Instruction Data
Paper
•
2410.18558
•
Published
•
20
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language
Tuning
Paper
•
2410.17779
•
Published
•
9
ROCKET-1: Master Open-World Interaction with Visual-Temporal Context
Prompting
Paper
•
2410.17856
•
Published
•
52
Continuous Speech Synthesis using per-token Latent Diffusion
Paper
•
2410.16048
•
Published
•
30
Paper
•
2410.21276
•
Published
•
86
Vision Search Assistant: Empower Vision-Language Models as Multimodal
Search Engines
Paper
•
2410.21220
•
Published
•
10
CLEAR: Character Unlearning in Textual and Visual Modalities
Paper
•
2410.18057
•
Published
•
210
Toxicity of the Commons: Curating Open-Source Pre-Training Data
Paper
•
2410.22587
•
Published
•
10
ReferEverything: Towards Segmenting Everything We Can Speak of in Videos
Paper
•
2410.23287
•
Published
•
19
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper
•
2410.23218
•
Published
•
51
Personalization of Large Language Models: A Survey
Paper
•
2411.00027
•
Published
•
35
Randomized Autoregressive Visual Generation
Paper
•
2411.00776
•
Published
•
17
Face Anonymization Made Simple
Paper
•
2411.00762
•
Published
•
7
AndroidLab: Training and Systematic Benchmarking of Android Autonomous
Agents
Paper
•
2410.24024
•
Published
•
51
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum
Reinforcement Learning
Paper
•
2411.02337
•
Published
•
38
How Far is Video Generation from World Model: A Physical Law Perspective
Paper
•
2411.02385
•
Published
•
36
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated
Parameters by Tencent
Paper
•
2411.02265
•
Published
•
25
Adaptive Caching for Faster Video Generation with Diffusion Transformers
Paper
•
2411.02397
•
Published
•
24
AutoVFX: Physically Realistic Video Editing from Natural Language
Instructions
Paper
•
2411.02394
•
Published
•
17
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for
Efficient Robot Execution
Paper
•
2411.02359
•
Published
•
13
Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM
Data Contamination
Paper
•
2411.03823
•
Published
•
49
Adaptive Length Image Tokenization via Recurrent Allocation
Paper
•
2411.02393
•
Published
•
13
ReCapture: Generative Video Camera Controls for User-Provided Videos
using Masked Video Fine-Tuning
Paper
•
2411.05003
•
Published
•
72
TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for
Image-to-Video Generation
Paper
•
2411.04709
•
Published
•
27
M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page
Multi-document Understanding
Paper
•
2411.04952
•
Published
•
30
Needle Threading: Can LLMs Follow Threads through Near-Million-Scale
Haystacks?
Paper
•
2411.05000
•
Published
•
23
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in
Videos
Paper
•
2411.04923
•
Published
•
23
Analyzing The Language of Visual Tokens
Paper
•
2411.05001
•
Published
•
25
LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation
Paper
•
2411.04997
•
Published
•
40
RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned
Vision-Language Models
Paper
•
2411.04097
•
Published
•
5
OmniEdit: Building Image Editing Generalist Models Through Specialist
Supervision
Paper
•
2411.07199
•
Published
•
50
Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language
Models
Paper
•
2411.07140
•
Published
•
35
Edify Image: High-Quality Image Generation with Pixel Space Laplacian
Diffusion Models
Paper
•
2411.07126
•
Published
•
31
Add-it: Training-Free Object Insertion in Images With Pretrained
Diffusion Models
Paper
•
2411.07232
•
Published
•
67
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified
Multimodal Understanding and Generation
Paper
•
2411.07975
•
Published
•
31
Autoregressive Models in Vision: A Survey
Paper
•
2411.05902
•
Published
•
18
MagicQuill: An Intelligent Interactive Image Editing System
Paper
•
2411.09703
•
Published
•
76
Sharingan: Extract User Action Sequence from Desktop Recordings
Paper
•
2411.08768
•
Published
•
10
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Paper
•
2411.10440
•
Published
•
124
Region-Aware Text-to-Image Generation via Hard Binding and Soft
Refinement
Paper
•
2411.06558
•
Published
•
37
The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer
Use
Paper
•
2411.10323
•
Published
•
35
Number it: Temporal Grounding Videos like Flipping Manga
Paper
•
2411.10332
•
Published
•
14
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large
Language Models on Mobile Devices
Paper
•
2411.10640
•
Published
•
47
Generative World Explorer
Paper
•
2411.11844
•
Published
•
78
AnimateAnything: Consistent and Controllable Animation for Video
Generation
Paper
•
2411.10836
•
Published
•
23
SlimLM: An Efficient Small Language Model for On-Device Document
Assistance
Paper
•
2411.09944
•
Published
•
12
Adaptive Decoding via Latent Preference Optimization
Paper
•
2411.09661
•
Published
•
10
StableV2V: Stablizing Shape Consistency in Video-to-Video Editing
Paper
•
2411.11045
•
Published
•
11
RedPajama: an Open Dataset for Training Large Language Models
Paper
•
2411.12372
•
Published
•
56
SymDPO: Boosting In-Context Learning of Large Multimodal Models with
Symbol Demonstration Direct Preference Optimization
Paper
•
2411.11909
•
Published
•
23
FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations
Paper
•
2411.10818
•
Published
•
27
ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text,
and Architectural Enhancements
Paper
•
2411.12044
•
Published
•
14
Continuous Speculative Decoding for Autoregressive Image Generation
Paper
•
2411.11925
•
Published
•
16
Enhancing the Reasoning Ability of Multimodal Large Language Models via
Mixed Preference Optimization
Paper
•
2411.10442
•
Published
•
81
Multimodal Autoregressive Pre-training of Large Vision Encoders
Paper
•
2411.14402
•
Published
•
47
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large
Language Models
Paper
•
2411.14432
•
Published
•
26
Large Multi-modal Models Can Interpret Features in Large Multi-modal
Models
Paper
•
2411.14982
•
Published
•
16
O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple
Distillation, Big Progress or Bitter Lesson?
Paper
•
2411.16489
•
Published
•
49
One Diffusion to Generate Them All
Paper
•
2411.16318
•
Published
•
31
DreamRunner: Fine-Grained Storytelling Video Generation with
Retrieval-Augmented Motion Adaptation
Paper
•
2411.16657
•
Published
•
19
Factorized Visual Tokenization and Generation
Paper
•
2411.16681
•
Published
•
19
TEXGen: a Generative Diffusion Model for Mesh Textures
Paper
•
2411.14740
•
Published
•
17
ROICtrl: Boosting Instance Control for Visual Generation
Paper
•
2411.17949
•
Published
•
88
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Paper
•
2411.17465
•
Published
•
87
SketchAgent: Language-Driven Sequential Sketch Generation
Paper
•
2411.17673
•
Published
•
19
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for
Training-Free Acceleration
Paper
•
2411.17686
•
Published
•
21
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
Paper
•
2411.15296
•
Published
•
22
Large Language Model-Brained GUI Agents: A Survey
Paper
•
2411.18279
•
Published
•
32
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video
Comprehension with Video-Text Duet Interaction Format
Paper
•
2411.17991
•
Published
•
5
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Paper
•
2411.18203
•
Published
•
37
On Domain-Specific Post-Training for Multimodal Large Language Models
Paper
•
2411.19930
•
Published
•
29
Yi-Lightning Technical Report
Paper
•
2412.01253
•
Published
•
29
X-Prompt: Towards Universal In-Context Image Generation in
Auto-Regressive Vision Language Foundation Models
Paper
•
2412.01824
•
Published
•
66
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding
by Video Spatiotemporal Augmentation
Paper
•
2412.00927
•
Published
•
28
Open-Sora Plan: Open-Source Large Video Generation Model
Paper
•
2412.00131
•
Published
•
34
SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction
with 3D Autonomous Characters
Paper
•
2412.00174
•
Published
•
23
VisOnlyQA: Large Vision Language Models Still Struggle with Visual
Perception of Geometric Information
Paper
•
2412.00947
•
Published
•
8
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand
Audio-Visual Information?
Paper
•
2412.02611
•
Published
•
24
PaliGemma 2: A Family of Versatile VLMs for Transfer
Paper
•
2412.03555
•
Published
•
135
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and
Generation
Paper
•
2412.03069
•
Published
•
35
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene
Understanding
Paper
•
2412.00493
•
Published
•
17
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual
Prompt Instruction Tuning
Paper
•
2412.03565
•
Published
•
11
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Paper
•
2412.04467
•
Published
•
111
Florence-VL: Enhancing Vision-Language Models with Generative Vision
Encoder and Depth-Breadth Fusion
Paper
•
2412.04424
•
Published
•
63
NVILA: Efficient Frontier Visual Language Models
Paper
•
2412.04468
•
Published
•
60
Negative Token Merging: Image-based Adversarial Feature Guidance
Paper
•
2412.01339
•
Published
•
23
Personalized Multimodal Large Language Models: A Survey
Paper
•
2412.02142
•
Published
•
14
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
Paper
•
2412.01169
•
Published
•
13
p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay
Paper
•
2412.04449
•
Published
•
7
Scaling Inference-Time Search with Vision Value Model for Improved
Visual Comprehension
Paper
•
2412.03704
•
Published
•
7
Expanding Performance Boundaries of Open-Source Multimodal Models with
Model, Data, and Test-Time Scaling
Paper
•
2412.05271
•
Published
•
155
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at
Scale
Paper
•
2412.05237
•
Published
•
48
LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment
Paper
•
2412.04814
•
Published
•
49
SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step
Diffusion
Paper
•
2412.04301
•
Published
•
39
CompCap: Improving Multimodal Large Language Models with Composite
Captions
Paper
•
2412.05243
•
Published
•
19
Mind the Time: Temporally-Controlled Multi-Event Video Generation
Paper
•
2412.05263
•
Published
•
11
BigDocs: An Open and Permissively-Licensed Dataset for Training
Multimodal Models on Document and Code Tasks
Paper
•
2412.04626
•
Published
•
14
Training Large Language Models to Reason in a Continuous Latent Space
Paper
•
2412.06769
•
Published
•
82
Around the World in 80 Timesteps: A Generative Approach to Global Visual
Geolocation
Paper
•
2412.06781
•
Published
•
21
Maya: An Instruction Finetuned Multilingual Multimodal Model
Paper
•
2412.07112
•
Published
•
29
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
Paper
•
2412.04432
•
Published
•
16
Exploring Multi-Grained Concept Annotations for Multimodal Large
Language Models
Paper
•
2412.05939
•
Published
•
16
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for
Customized Manga Generation
Paper
•
2412.07589
•
Published
•
49
Perception Tokens Enhance Visual Reasoning in Multimodal Language Models
Paper
•
2412.03548
•
Published
•
17
POINTS1.5: Building a Vision-Language Model towards Real World
Applications
Paper
•
2412.08443
•
Published
•
39
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex
Image-Text Models with Structural Annotations
Paper
•
2412.08580
•
Published
•
46
MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation
Paper
•
2412.07147
•
Published
•
5
StreamChat: Chatting with Streaming Video
Paper
•
2412.08646
•
Published
•
18
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for
Long-term Streaming Video and Audio Interactions
Paper
•
2412.09596
•
Published
•
99
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity
Visual Descriptions
Paper
•
2412.08737
•
Published
•
54
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
Paper
•
2412.09501
•
Published
•
49
Multimodal Latent Language Modeling with Next-Token Diffusion
Paper
•
2412.08635
•
Published
•
45
EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via
Multimodal LLM
Paper
•
2412.09618
•
Published
•
21
VisionArena: 230K Real World User-VLM Conversations with Preference
Labels
Paper
•
2412.08687
•
Published
•
13
Arbitrary-steps Image Super-resolution via Diffusion Inversion
Paper
•
2412.09013
•
Published
•
13
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Paper
•
2412.10360
•
Published
•
147
GenEx: Generating an Explorable World
Paper
•
2412.09624
•
Published
•
96
InstanceCap: Improving Text-to-Video Generation via Instance-aware
Structured Caption
Paper
•
2412.09283
•
Published
•
19
Multimodal Music Generation with Explicit Bridges and Retrieval
Augmentation
Paper
•
2412.09428
•
Published
•
7
SynerGen-VL: Towards Synergistic Image Understanding and Generation with
Vision Experts and Token Folding
Paper
•
2412.09604
•
Published
•
38
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper
•
2412.09871
•
Published
•
97
BrushEdit: All-In-One Image Inpainting and Editing
Paper
•
2412.10316
•
Published
•
35
VidTok: A Versatile and Open-Source Video Tokenizer
Paper
•
2412.13061
•
Published
•
8
Paper
•
2412.13501
•
Published
•
29
Progressive Multimodal Reasoning via Active Retrieval
Paper
•
2412.14835
•
Published
•
74
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
Paper
•
2412.14475
•
Published
•
55
Descriptive Caption Enhancement with Visual Specialists for Multimodal
Perception
Paper
•
2412.14233
•
Published
•
6
Large Motion Video Autoencoding with Cross-modal Video VAE
Paper
•
2412.17805
•
Published
•
24
Friends-MMC: A Dataset for Multi-modal Multi-party Conversation
Understanding
Paper
•
2412.17295
•
Published
•
9
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution
Paper
•
2412.15213
•
Published
•
29
Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion
Paper
•
2412.14462
•
Published
•
15
AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal
Audio-Video Generation
Paper
•
2412.15191
•
Published
•
5
Parallelized Autoregressive Visual Generation
Paper
•
2412.15119
•
Published
•
54
Taming Multimodal Joint Training for High-Quality Video-to-Audio
Synthesis
Paper
•
2412.15322
•
Published
•
18
Sequence Matters: Harnessing Video Models in 3D Super-Resolution
Paper
•
2412.11525
•
Published
•
11
Diving into Self-Evolving Training for Multimodal Reasoning
Paper
•
2412.17451
•
Published
•
44
Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models
with Flow Matching
Paper
•
2412.17153
•
Published
•
37
NILE: Internal Consistency Alignment in Large Language Models
Paper
•
2412.16686
•
Published
•
8
DepthLab: From Partial to Complete
Paper
•
2412.18153
•
Published
•
37
3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D
Scene Understanding
Paper
•
2412.18450
•
Published
•
37
Fourier Position Embedding: Enhancing Attention's Periodic Extension for
Length Generalization
Paper
•
2412.17739
•
Published
•
42
DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion
Transformer for Tuning-Free Multi-Prompt Longer Video Generation
Paper
•
2412.18597
•
Published
•
19
How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation
System?
Paper
•
2412.18495
•
Published
•
9
Video-Panda: Parameter-efficient Alignment for Encoder-free
Video-Language Models
Paper
•
2412.18609
•
Published
•
18
Bridging the Data Provenance Gap Across Text, Speech and Video
Paper
•
2412.17847
•
Published
•
9
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via
Collective Monte Carlo Tree Search
Paper
•
2412.18319
•
Published
•
40
YuLan-Mini: An Open Data-efficient Language Model
Paper
•
2412.17743
•
Published
•
67
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks
Paper
•
2412.18072
•
Published
•
19
Molar: Multimodal LLMs with Collaborative Filtering Alignment for
Enhanced Sequential Recommendation
Paper
•
2412.18176
•
Published
•
16
Paper
•
2412.18653
•
Published
•
84
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive
Survey
Paper
•
2412.18619
•
Published
•
58
Task Preference Optimization: Improving Multimodal Large Language Models
with Vision Task Alignment
Paper
•
2412.19326
•
Published
•
18
Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging
Paper
•
2412.19512
•
Published
•
8
Explanatory Instructions: Towards Unified Vision Tasks Understanding and
Zero-shot Generalization
Paper
•
2412.18525
•
Published
•
76
Edicho: Consistent Image Editing in the Wild
Paper
•
2412.21079
•
Published
•
23
TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow
Matching and Clap-Ranked Preference Optimization
Paper
•
2412.21037
•
Published
•
24
Are Vision-Language Models Truly Understanding Multi-vision Sensor?
Paper
•
2412.20750
•
Published
•
20
2.5 Years in Class: A Multimodal Textbook for Vision-Language
Pretraining
Paper
•
2501.00958
•
Published
•
107
VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion
Control
Paper
•
2501.01427
•
Published
•
55
LTX-Video: Realtime Video Latent Diffusion
Paper
•
2501.00103
•
Published
•
47
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with
Video LLM
Paper
•
2501.00599
•
Published
•
48
MLLM-as-a-Judge for Image Safety without Human Labeling
Paper
•
2501.00192
•
Published
•
30
A3: Android Agent Arena for Mobile GUI Agents
Paper
•
2501.01149
•
Published
•
22
Unifying Specialized Visual Encoders for Video Language Models
Paper
•
2501.01426
•
Published
•
21
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Paper
•
2501.01957
•
Published
•
46
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One
Vision Token
Paper
•
2501.03895
•
Published
•
53
MotionBench: Benchmarking and Improving Fine-grained Video Motion
Understanding for Vision Language Models
Paper
•
2501.02955
•
Published
•
45
Cosmos World Foundation Model Platform for Physical AI
Paper
•
2501.03575
•
Published
•
78
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language
Models
Paper
•
2501.03262
•
Published
•
99
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of
Images and Videos
Paper
•
2501.04001
•
Published
•
46
OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment
across Language with Real-time Self-Aware Emotional Speech Synthesis
Paper
•
2501.04561
•
Published
•
16
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning
and Reflection
Paper
•
2501.04575
•
Published
•
24
Search-o1: Agentic Search-Enhanced Large Reasoning Models
Paper
•
2501.05366
•
Published
•
102
DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich
Paradigm for Direct Preference Optimization
Paper
•
2501.03271
•
Published
•
11
The GAN is dead; long live the GAN! A Modern GAN Baseline
Paper
•
2501.05441
•
Published
•
92
Enhancing Human-Like Responses in Large Language Models
Paper
•
2501.05032
•
Published
•
55
An Empirical Study of Autoregressive Pre-training from Videos
Paper
•
2501.05453
•
Published
•
42
Centurio: On Drivers of Multilingual Ability of Large Vision-Language
Model
Paper
•
2501.05122
•
Published
•
20
On Computational Limits and Provably Efficient Criteria of Visual
Autoregressive Models: A Fine-Grained Complexity Analysis
Paper
•
2501.04377
•
Published
•
14
VideoRAG: Retrieval-Augmented Generation over Video Corpus
Paper
•
2501.05874
•
Published
•
72
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Paper
•
2501.06186
•
Published
•
66
Migician: Revealing the Magic of Free-Form Multi-Image Grounding in
Multimodal Large Language Models
Paper
•
2501.05767
•
Published
•
30
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video
Understanding?
Paper
•
2501.05510
•
Published
•
44
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
Paper
•
2501.06282
•
Published
•
51
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token
Marks
Paper
•
2501.08326
•
Published
•
35
MatchAnything: Universal Cross-Modality Image Matching with Large-Scale
Pre-Training
Paper
•
2501.07556
•
Published
•
5
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents
Paper
•
2501.08828
•
Published
•
32
RepVideo: Rethinking Cross-Layer Representation for Video Generation
Paper
•
2501.08994
•
Published
•
15
ReFocus: Visual Editing as a Chain of Thought for Structured Image
Understanding
Paper
•
2501.05452
•
Published
•
15
Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains
Paper
•
2501.05707
•
Published
•
20
VideoAuteur: Towards Long Narrative Video Generation
Paper
•
2501.06173
•
Published
•
34
SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training
Paper
•
2501.06842
•
Published
•
16
Evaluating Sample Utility for Data Selection by Mimicking Model Weights
Paper
•
2501.06708
•
Published
•
5
MiniMax-01: Scaling Foundation Models with Lightning Attention
Paper
•
2501.08313
•
Published
•
285
Democratizing Text-to-Image Masked Generative Models with Compact
Text-Aware One-Dimensional Tokens
Paper
•
2501.07730
•
Published
•
17
HALoGEN: Fantastic LLM Hallucinations and Where to Find Them
Paper
•
2501.08292
•
Published
•
17
Tarsier2: Advancing Large Vision-Language Models from Detailed Video
Description to Comprehensive Video Understanding
Paper
•
2501.07888
•
Published
•
15
OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for
LLM Training
Paper
•
2501.08197
•
Published
•
8
Parameter-Inverted Image Pyramid Networks for Visual Perception and
Multimodal Understanding
Paper
•
2501.07783
•
Published
•
7
MINIMA: Modality Invariant Image Matching
Paper
•
2412.19412
•
Published
•
4
OmniThink: Expanding Knowledge Boundaries in Machine Writing through
Thinking
Paper
•
2501.09751
•
Published
•
49
Learnings from Scaling Visual Tokenizers for Reconstruction and
Generation
Paper
•
2501.09755
•
Published
•
37
Do generative video models learn physical principles from watching
videos?
Paper
•
2501.09038
•
Published
•
34
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Paper
•
2501.09747
•
Published
•
23
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
Paper
•
2501.09781
•
Published
•
29
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Paper
•
2501.12380
•
Published
•
86
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
Paper
•
2501.11733
•
Published
•
29
Can We Generate Images with CoT? Let's Verify and Reinforce Image
Generation Step by Step
Paper
•
2501.13926
•
Published
•
42
Baichuan-Omni-1.5 Technical Report
Paper
•
2501.15368
•
Published
•
63
Qwen2.5-1M Technical Report
Paper
•
2501.15383
•
Published
•
70
Towards General-Purpose Model-Free Reinforcement Learning
Paper
•
2501.16142
•
Published
•
29
Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for
Speech Generation
Paper
•
2501.15907
•
Published
•
16
Are Vision Language Models Texture or Shape Biased and Can We Steer
Them?
Paper
•
2403.09193
•
Published
•
9
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model
Post-training
Paper
•
2501.17161
•
Published
•
120
PixelWorld: Towards Perceiving Everything as Pixels
Paper
•
2501.19339
•
Published
•
17
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human
Animation Models
Paper
•
2502.01061
•
Published
•
212
Process Reinforcement through Implicit Rewards
Paper
•
2502.01456
•
Published
•
60
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal
Understanding
Paper
•
2502.01341
•
Published
•
39
Paper
•
2501.14249
•
Published
•
72
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video
Understanding
Paper
•
2501.13106
•
Published
•
91
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Paper
•
2501.12599
•
Published
•
113
Test-Time Preference Optimization: On-the-Fly Alignment via Iterative
Textual Feedback
Paper
•
2501.12895
•
Published
•
60
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Reinforcement Learning
Paper
•
2501.12948
•
Published
•
381
Token Assorted: Mixing Latent and Text Tokens for Improved Language
Model Reasoning
Paper
•
2502.03275
•
Published
•
17
Analyze Feature Flow to Enhance Interpretation and Steering in Language
Models
Paper
•
2502.03032
•
Published
•
60
Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive
Modality Alignment
Paper
•
2502.04328
•
Published
•
30
VideoRoPE: What Makes for Good Video Rotary Position Embedding?
Paper
•
2502.05173
•
Published
•
65
Fast Video Generation with Sliding Tile Attention
Paper
•
2502.04507
•
Published
•
51
Goku: Flow Based Video Generative Foundation Models
Paper
•
2502.04896
•
Published
•
104
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth
Approach
Paper
•
2502.05171
•
Published
•
137
QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive
Multimodal Understanding and Generation
Paper
•
2502.05178
•
Published
•
10
On-device Sora: Enabling Diffusion-Based Text-to-Video Generation for
Mobile Devices
Paper
•
2502.04363
•
Published
•
12
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time
Scaling
Paper
•
2502.06703
•
Published
•
150
Scaling Pre-training to One Hundred Billion Data for Vision Language
Models
Paper
•
2502.07617
•
Published
•
29
Expect the Unexpected: FailSafe Long Context QA for Finance
Paper
•
2502.06329
•
Published
•
131
Magic 1-For-1: Generating One Minute Video Clips within One Minute
Paper
•
2502.07701
•
Published
•
36
Light-A-Video: Training-free Video Relighting via Progressive Light
Fusion
Paper
•
2502.08590
•
Published
•
44
TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation
Paper
•
2502.07870
•
Published
•
44
WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation
Paper
•
2502.08047
•
Published
•
27
TransMLA: Multi-head Latent Attention Is All You Need
Paper
•
2502.07864
•
Published
•
49
mmE5: Improving Multimodal Multilingual Embeddings via High-quality
Synthetic Data
Paper
•
2502.08468
•
Published
•
13
The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of
Physical Concept Understanding
Paper
•
2502.08946
•
Published
•
193
Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient
Text-to-Image Generation
Paper
•
2502.08690
•
Published
•
43
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language
Models for Vision-Driven Embodied Agents
Paper
•
2502.09560
•
Published
•
36
ZeroBench: An Impossible Visual Benchmark for Contemporary Large
Multimodal Models
Paper
•
2502.09696
•
Published
•
43
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of
Video Foundation Model
Paper
•
2502.10248
•
Published
•
55
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
Paper
•
2502.10391
•
Published
•
34
Large Language Diffusion Models
Paper
•
2502.09992
•
Published
•
112
Learning Getting-Up Policies for Real-World Humanoid Robots
Paper
•
2502.12152
•
Published
•
41
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse
Attention
Paper
•
2502.11089
•
Published
•
153
How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on
Continual Pre-Training
Paper
•
2502.11196
•
Published
•
22
I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning
in Diffusion Models
Paper
•
2502.10458
•
Published
•
35
HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and
Generation
Paper
•
2502.12148
•
Published
•
16
Intuitive physics understanding emerges from self-supervised pretraining
on natural videos
Paper
•
2502.11831
•
Published
•
18
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Paper
•
2502.11775
•
Published
•
8
Ask in Any Modality: A Comprehensive Survey on Multimodal
Retrieval-Augmented Generation
Paper
•
2502.08826
•
Published
•
17
ILIAS: Instance-Level Image retrieval At Scale
Paper
•
2502.11748
•
Published
•
4
Soundwave: Less is More for Speech-Text Alignment in LLMs
Paper
•
2502.12900
•
Published
•
84
Continuous Diffusion Model for Language Modeling
Paper
•
2502.11564
•
Published
•
53
Phantom: Subject-consistent video generation via cross-modal alignment
Paper
•
2502.11079
•
Published
•
58
Magma: A Foundation Model for Multimodal AI Agents
Paper
•
2502.13130
•
Published
•
58
SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance
Software Engineering?
Paper
•
2502.12115
•
Published
•
45
Multimodal Mamba: Decoder-only Multimodal State Space Model via
Quadratic to Linear Distillation
Paper
•
2502.13145
•
Published
•
38
RealSyn: An Effective and Scalable Multimodal Interleaved Document
Transformation Paradigm
Paper
•
2502.12513
•
Published
•
15
Harnessing Vision Models for Time Series Analysis: A Survey
Paper
•
2502.08869
•
Published
•
2
Qwen2.5-VL Technical Report
Paper
•
2502.13923
•
Published
•
181
On the Trustworthiness of Generative Foundation Models: Guideline,
Assessment, and Perspective
Paper
•
2502.14296
•
Published
•
46
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic
Understanding, Localization, and Dense Features
Paper
•
2502.14786
•
Published
•
142
How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?
Paper
•
2502.14502
•
Published
•
90
LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in
Vision-Language Models
Paper
•
2502.14834
•
Published
•
24
Does Time Have Its Place? Temporal Heads: Where Language Models Recall
Time-specific Information
Paper
•
2502.14258
•
Published
•
26
PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex
Task Automation on PC
Paper
•
2502.14282
•
Published
•
20
How to Get Your LLM to Generate Challenging Problems for Evaluation
Paper
•
2502.14678
•
Published
•
17
Dynamic Concepts Personalization from Single Videos
Paper
•
2502.14844
•
Published
•
16
Scaling Text-Rich Image Understanding via Code-Guided Synthetic
Multimodal Data Generation
Paper
•
2502.14846
•
Published
•
13
NAVIG: Natural Language-guided Analysis with Vision Language Models for
Image Geo-localization
Paper
•
2502.14638
•
Published
•
11
From RAG to Memory: Non-Parametric Continual Learning for Large Language
Models
Paper
•
2502.14802
•
Published
•
13
Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the
Limits of Embedding Space Capacity
Paper
•
2502.13063
•
Published
•
69
VLM^2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit
Matching Visual Cues
Paper
•
2502.12084
•
Published
•
29
LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context
Memory of Transformers
Paper
•
2502.15007
•
Published
•
172
SurveyX: Academic Survey Automation via Large Language Models
Paper
•
2502.14776
•
Published
•
97
PhotoDoodle: Learning Artistic Image Editing from Few-Shot Pairwise Data
Paper
•
2502.14397
•
Published
•
41
DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks
Paper
•
2502.17157
•
Published
•
53
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for
Multimodal Reasoning Models
Paper
•
2502.16033
•
Published
•
17
Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon
Robotic Manipulation
Paper
•
2502.16707
•
Published
•
13
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference
Paper
•
2502.18411
•
Published
•
73
SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference
Paper
•
2502.18137
•
Published
•
55
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open
Software Evolution
Paper
•
2502.18449
•
Published
•
73
KV-Edit: Training-Free Image Editing for Precise Background Preservation
Paper
•
2502.17363
•
Published
•
36
ART: Anonymous Region Transformer for Variable Multi-Layer Transparent
Image Generation
Paper
•
2502.18364
•
Published
•
36
K-LoRA: Unlocking Training-Free Fusion of Any Subject and Style LoRAs
Paper
•
2502.18461
•
Published
•
15
Introducing Visual Perception Token into Multimodal Large Language Model
Paper
•
2502.17425
•
Published
•
15
MLLMs Know Where to Look: Training-free Perception of Small Visual
Details with Multimodal LLMs
Paper
•
2502.17422
•
Published
•
7
LDGen: Enhancing Text-to-Image Synthesis via Large Language Model-Driven
Language Representation
Paper
•
2502.18302
•
Published
•
5
Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI
Paper
•
2502.17092
•
Published
•
3
TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem
Understanding
Paper
•
2502.19400
•
Published
•
48
Towards an AI co-scientist
Paper
•
2502.18864
•
Published
•
48
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language
Models (VLMs) via Reinforcement Learning
Paper
•
2502.19634
•
Published
•
63
UniTok: A Unified Tokenizer for Visual Generation and Understanding
Paper
•
2502.20321
•
Published
•
30
Multimodal Representation Alignment for Image Generation: Text-Image
Interleaved Control Is Easier Than You Think
Paper
•
2502.20172
•
Published
•
28
HAIC: Improving Human Action Understanding and Generation with Better
Captions for Multi-modal Large Language Models
Paper
•
2502.20811
•
Published
•
2
Chain of Draft: Thinking Faster by Writing Less
Paper
•
2502.18600
•
Published
•
47
Tell me why: Visual foundation models as self-explainable classifiers
Paper
•
2502.19577
•
Published
•
10
SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers
Paper
•
2502.20545
•
Published
•
20
MIGE: A Unified Framework for Multimodal Instruction-Based Image
Generation and Editing
Paper
•
2502.21291
•
Published
•
5
Predictive Data Selection: The Data That Predicts Is the Data That
Teaches
Paper
•
2503.00808
•
Published
•
57
Visual-RFT: Visual Reinforcement Fine-Tuning
Paper
•
2503.01785
•
Published
•
76
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language
Models via Mixture-of-LoRAs
Paper
•
2503.01743
•
Published
•
83
Qilin: A Multimodal Information Retrieval Dataset with APP-level User
Sessions
Paper
•
2503.00501
•
Published
•
11
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open
Language Models
Paper
•
2402.03300
•
Published
•
115
DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in
Multimodal Cycles
Paper
•
2503.03651
•
Published
•
16
UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended
Language Interface
Paper
•
2503.01342
•
Published
•
8
From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence
Generation up to 100K Tokens
Paper
•
2502.18890
•
Published
•
28
HoT: Highlighted Chain of Thought for Referencing Supporting Facts from
Inputs
Paper
•
2503.02003
•
Published
•
46
Process-based Self-Rewarding Language Models
Paper
•
2503.03746
•
Published
•
39
CognitiveDrone: A VLA Model and Evaluation Benchmark for Real-Time
Cognitive Task Solving and Reasoning in UAVs
Paper
•
2503.01378
•
Published
•
3
Token-Efficient Long Video Understanding for Multimodal LLMs
Paper
•
2503.04130
•
Published
•
93
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
Paper
•
2503.04724
•
Published
•
69
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding
and Expert Reasoning Abilities
Paper
•
2503.03983
•
Published
•
22
How to Steer LLM Latents for Hallucination Detection?
Paper
•
2503.01917
•
Published
•
11
The Best of Both Worlds: Integrating Language Models and Diffusion
Models for Video Generation
Paper
•
2503.04606
•
Published
•
9
Unified Reward Model for Multimodal Understanding and Generation
Paper
•
2503.05236
•
Published
•
117
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model
Paper
•
2503.05132
•
Published
•
55
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive
Cognitive-Inspired Sketching
Paper
•
2503.05179
•
Published
•
44
S2S-Arena, Evaluating Speech2Speech Protocols on Instruction Following
with Paralinguistic Information
Paper
•
2503.05085
•
Published
•
47
R1-Omni: Explainable Omni-Multimodal Emotion Recognition with
Reinforcing Learning
Paper
•
2503.05379
•
Published
•
34
VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play
Context Control
Paper
•
2503.05639
•
Published
•
22
TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos
via Diffusion Models
Paper
•
2503.05638
•
Published
•
18
MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale
Reinforcement Learning
Paper
•
2503.07365
•
Published
•
57
Automated Movie Generation via Multi-Agent CoT Planning
Paper
•
2503.07314
•
Published
•
43
Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue
Learning
Paper
•
2503.07002
•
Published
•
39
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large
Language Models
Paper
•
2503.06749
•
Published
•
27
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural
Vision-Language Dataset for Southeast Asia
Paper
•
2503.07920
•
Published
•
97
MagicInfinite: Generating Infinite Talking Videos with Your Words and
Voice
Paper
•
2503.05978
•
Published
•
35
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through
Two-Stage Rule-Based RL
Paper
•
2503.07536
•
Published
•
84
Video Action Differencing
Paper
•
2503.07860
•
Published
•
32
UniF^2ace: Fine-grained Face Understanding and Generation
with Unified Multimodal Models
Paper
•
2503.08120
•
Published
•
31
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by
Imitating Human Annotator Trajectories
Paper
•
2503.08625
•
Published
•
26
Implicit Reasoning in Transformers is Reasoning through Shortcuts
Paper
•
2503.07604
•
Published
•
21
LightGen: Efficient Image Generation through Knowledge Distillation and
Direct Preference Optimization
Paper
•
2503.08619
•
Published
•
20
EasyControl: Adding Efficient and Flexible Control for Diffusion
Transformer
Paper
•
2503.07027
•
Published
•
28
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted
Contrastive Learning
Paper
•
2503.04812
•
Published
•
14
Words or Vision: Do Vision-Language Models Have Blind Faith in Text?
Paper
•
2503.02199
•
Published
•
8
Seedream 2.0: A Native Chinese-English Bilingual Image Generation
Foundation Model
Paper
•
2503.07703
•
Published
•
35
Gemini Embedding: Generalizable Embeddings from Gemini
Paper
•
2503.07891
•
Published
•
36
OmniMamba: Efficient and Unified Multimodal Understanding and Generation
via State Space Models
Paper
•
2503.08686
•
Published
•
18
CineBrain: A Large-Scale Multi-Modal Brain Dataset During Naturalistic
Audiovisual Narrative Processing
Paper
•
2503.06940
•
Published
•
11
Transformers without Normalization
Paper
•
2503.10622
•
Published
•
155
Charting and Navigating Hugging Face's Model Atlas
Paper
•
2503.10633
•
Published
•
76
World Modeling Makes a Better Planner: Dual Preference Optimization for
Embodied Task Planning
Paper
•
2503.10480
•
Published
•
49
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model
for Visual Generation and Editing
Paper
•
2503.10639
•
Published
•
48
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
Paper
•
2503.10291
•
Published
•
34
4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large
Language Models
Paper
•
2503.10437
•
Published
•
31
CoRe^2: Collect, Reflect and Refine to Generate Better and Faster
Paper
•
2503.09662
•
Published
•
33
OmniPaint: Mastering Object-Oriented Editing via Disentangled
Insertion-Removal Inpainting
Paper
•
2503.08677
•
Published
•
28
Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and
Beyond
Paper
•
2503.10460
•
Published
•
27
GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding
Paper
•
2503.10596
•
Published
•
18
R1-Onevision: Advancing Generalized Multimodal Reasoning through
Cross-Modal Formalization
Paper
•
2503.10615
•
Published
•
16
Open-Sora 2.0: Training a Commercial-Level Video Generation Model in
$200k
Paper
•
2503.09642
•
Published
•
17
PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference
Time by Leveraging Sparsity
Paper
•
2503.07677
•
Published
•
82
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video
Paper
•
2503.11647
•
Published
•
132
GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories
Generation in End-to-End Autonomous Driving
Paper
•
2503.05689
•
Published
•
3
SmolDocling: An ultra-compact vision-language model for end-to-end
multi-modal document conversion
Paper
•
2503.11576
•
Published
•
94
Large-scale Pre-training for Grounded Video Caption Generation
Paper
•
2503.10781
•
Published
•
16
ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model
with Interleaved Multimodal Generation via Asymmetric Synergy
Paper
•
2503.06542
•
Published
•
8
DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal
Consistent Video Generation
Paper
•
2503.06053
•
Published
•
136
Being-0: A Humanoid Robotic Agent with Vision-Language Models and
Modular Skills
Paper
•
2503.12533
•
Published
•
63
DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale
Text-to-Image Models
Paper
•
2503.12885
•
Published
•
43
Edit Transfer: Learning Image Editing via Vision In-Context Relations
Paper
•
2503.13327
•
Published
•
28
BlobCtrl: A Unified and Flexible Framework for Element-level Image
Generation and Editing
Paper
•
2503.13434
•
Published
•
25
R1-VL: Learning to Reason with Multimodal Large Language Models via
Step-wise Group Relative Policy Optimization
Paper
•
2503.12937
•
Published
•
27
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Paper
•
2503.12605
•
Published
•
33
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs
for Knowledge-Intensive Visual Grounding
Paper
•
2503.12797
•
Published
•
29
Aligning Multimodal LLM with Human Preference: A Survey
Paper
•
2503.14504
•
Published
•
22
Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal
Control
Paper
•
2503.14492
•
Published
•
17
TULIP: Towards Unified Language-Image Pretraining
Paper
•
2503.15485
•
Published
•
44
φ-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time
Exploration and Exploitation
Paper
•
2503.13288
•
Published
•
49
Temporal Regularization Makes Your Video Generator Stronger
Paper
•
2503.15417
•
Published
•
21
VERIFY: A Benchmark of Visual Explanation and Reasoning for
Investigating Multimodal Reasoning Fidelity
Paper
•
2503.11557
•
Published
•
20
Stop Overthinking: A Survey on Efficient Reasoning for Large Language
Models
Paper
•
2503.16419
•
Published
•
68
Unleashing Vecset Diffusion Model for Fast Shape Generation
Paper
•
2503.16302
•
Published
•
43
DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers
Paper
•
2503.14487
•
Published
•
27
JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play
Visual Games with Keyboards and Mouse
Paper
•
2503.16365
•
Published
•
38
InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity
Paper
•
2503.16418
•
Published
•
34
Ultra-Resolution Adaptation with Ease
Paper
•
2503.16322
•
Published
•
13
M3: 3D-Spatial MultiModal Memory
Paper
•
2503.16413
•
Published
•
15
See-Saw Modality Balance: See Gradient, and Sew Impaired Vision-Language
Balance to Mitigate Dominant Modality Bias
Paper
•
2503.13834
•
Published
•
5
Expert Race: A Flexible Routing Strategy for Scaling Diffusion
Transformer with Mixture of Experts
Paper
•
2503.16057
•
Published
•
14
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Paper
•
2503.14476
•
Published
•
117
RWKV-7 "Goose" with Expressive Dynamic State Evolution
Paper
•
2503.14456
•
Published
•
137
Paper
•
2503.14378
•
Published
•
59
Reinforcement Learning for Reasoning in Small LLMs: What Works and What
Doesn't
Paper
•
2503.16219
•
Published
•
46
Inside-Out: Hidden Factual Knowledge in LLMs
Paper
•
2503.15299
•
Published
•
53
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Paper
•
2503.15558
•
Published
•
45
Where do Large Vision-Language Models Look at when Answering Questions?
Paper
•
2503.13891
•
Published
•
8
MAPS: A Multi-Agent Framework Based on Big Seven Personality and
Socratic Guidance for Multimodal Scientific Problem Solving
Paper
•
2503.16905
•
Published
•
53
OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning
via Iterative Self-Improvement
Paper
•
2503.17352
•
Published
•
22
Bridging Continuous and Discrete Tokens for Autoregressive Visual
Generation
Paper
•
2503.16430
•
Published
•
34
When Preferences Diverge: Aligning Diffusion Models with Minority-Aware
Adaptive DPO
Paper
•
2503.16921
•
Published
•
6
From Head to Tail: Towards Balanced Representation in Large
Vision-Language Models through Adaptive Data Calibration
Paper
•
2503.12821
•
Published
•
9
MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical
Problems
Paper
•
2503.16549
•
Published
•
13
Why Do Multi-Agent LLM Systems Fail?
Paper
•
2503.13657
•
Published
•
42
When Less is Enough: Adaptive Token Reduction for Efficient Image
Representation
Paper
•
2503.16660
•
Published
•
71
Can Large Vision Language Models Read Maps Like a Human?
Paper
•
2503.14607
•
Published
•
9
GAEA: A Geolocation Aware Conversational Model
Paper
•
2503.16423
•
Published
•
6
I Have Covered All the Bases Here: Interpreting Reasoning Features in
Large Language Models via Sparse Autoencoders
Paper
•
2503.18878
•
Published
•
114
Video-T1: Test-Time Scaling for Video Generation
Paper
•
2503.18942
•
Published
•
86
SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for
Open Base Models in the Wild
Paper
•
2503.18892
•
Published
•
29
Aether: Geometric-Aware Unified World Modeling
Paper
•
2503.18945
•
Published
•
27
Judge Anything: MLLM as a Judge Across Any Modality
Paper
•
2503.17489
•
Published
•
19
Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models
via Vision-Guided Reinforcement Learning
Paper
•
2503.18013
•
Published
•
18
Mind with Eyes: from Language Reasoning to Multimodal Reasoning
Paper
•
2503.18071
•
Published
•
3
Exploring Hallucination of Large Multimodal Models in Video
Understanding: Benchmark, Analysis and Mitigation
Paper
•
2503.19622
•
Published
•
29
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
Paper
•
2503.18931
•
Published
•
29
Long-Context Autoregressive Video Modeling with Next-Frame Prediction
Paper
•
2503.19325
•
Published
•
71
Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection
with Artifact Explanation
Paper
•
2503.14905
•
Published
•
19
When Words Outperform Vision: VLMs Can Self-Improve Via Text-Only
Training For Human-Centered Decision Making
Paper
•
2503.16965
•
Published
•
4
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?
Paper
•
2503.19990
•
Published
•
33
Dita: Scaling Diffusion Transformer for Generalist
Vision-Language-Action Policy
Paper
•
2503.19757
•
Published
•
50
GenHancer: Imperfect Generative Models are Secretly Strong
Vision-Centric Enhancers
Paper
•
2503.19480
•
Published
•
15
Qwen2.5-Omni Technical Report
Paper
•
2503.20215
•
Published
•
134
Wan: Open and Advanced Large-Scale Video Generative Models
Paper
•
2503.20314
•
Published
•
48
Open Deep Search: Democratizing Search with Open-source Reasoning Agents
Paper
•
2503.20201
•
Published
•
44
Beyond Words: Advancing Long-Text Image Generation via Multimodal
Autoregressive Models
Paper
•
2503.20198
•
Published
•
4
Video-R1: Reinforcing Video Reasoning in MLLMs
Paper
•
2503.21776
•
Published
•
76
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement
Learning
Paper
•
2503.21620
•
Published
•
58
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for
Embodied Interactive Tasks
Paper
•
2503.21696
•
Published
•
21
A Survey of Efficient Reasoning for Large Reasoning Models: Language,
Multimodality, and Beyond
Paper
•
2503.21614
•
Published
•
39
OThink-MR1: Stimulating multimodal generalized reasoning capabilities
via dynamic reinforcement learning
Paper
•
2503.16081
•
Published
•
26
Your ViT is Secretly an Image Segmentation Model
Paper
•
2503.19108
•
Published
•
20
On Large Multimodal Models as Open-World Image Classifiers
Paper
•
2503.21851
•
Published
•
5
TextCrafter: Accurately Rendering Multiple Texts in Complex Visual
Scenes
Paper
•
2503.23461
•
Published
•
93
Any2Caption:Interpreting Any Condition to Caption for Controllable Video
Generation
Paper
•
2503.24379
•
Published
•
74
Exploring the Effect of Reinforcement Learning on Video Understanding:
Insights from SEED-Bench-R1
Paper
•
2503.24376
•
Published
•
37
Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal
LLMs on Academic Resources
Paper
•
2504.00595
•
Published
•
34
Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on
Elementary School-Level Reasoning Problems?
Paper
•
2504.00509
•
Published
•
21
MoCha: Towards Movie-Grade Talking Character Synthesis
Paper
•
2503.23307
•
Published
•
121
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement
Learning on the Base Model
Paper
•
2503.24290
•
Published
•
61
Unicorn: Text-Only Data Synthesis for Vision Language Model Training
Paper
•
2503.22655
•
Published
•
37
TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through
Task Tokenization
Paper
•
2503.19901
•
Published
•
35
Expanding RL with Verifiable Rewards Across Diverse Domains
Paper
•
2503.23829
•
Published
•
18
Paper
•
2504.00927
•
Published
•
43
OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming
Video Contexts
Paper
•
2503.22952
•
Published
•
18
Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features
Paper
•
2504.00557
•
Published
•
15
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs
Paper
•
2504.00072
•
Published
•
7
MergeVQ: A Unified Framework for Visual Generation and Representation
with Disentangled Token Merging and Quantization
Paper
•
2504.00999
•
Published
•
78
Improved Visual-Spatial Reasoning via R1-Zero-Like Training
Paper
•
2504.00883
•
Published
•
60
DreamActor-M1: Holistic, Expressive and Robust Human Image Animation
with Hybrid Guidance
Paper
•
2504.01724
•
Published
•
61
AnimeGamer: Infinite Anime Life Simulation with Next Game State
Prediction
Paper
•
2504.01014
•
Published
•
59
Towards Physically Plausible Video Generation via VLM Planning
Paper
•
2503.23368
•
Published
•
38
Understanding R1-Zero-Like Training: A Critical Perspective
Paper
•
2503.20783
•
Published
•
40
ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and
Diffusion Refinement
Paper
•
2504.01934
•
Published
•
22
Articulated Kinematics Distillation from Video Diffusion Models
Paper
•
2504.01204
•
Published
•
23
Safeguarding Vision-Language Models: Mitigating Vulnerabilities to
Gaussian Noise in Perturbation-based Attacks
Paper
•
2504.01308
•
Published
•
13
DASH: Detection and Assessment of Systematic Hallucinations of VLMs
Paper
•
2503.23573
•
Published
•
12
Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal
Representations
Paper
•
2503.18817
•
Published
•
3
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual
Editing
Paper
•
2504.02826
•
Published
•
67
WikiVideo: Article Generation from Multiple Videos
Paper
•
2504.00939
•
Published
•
36
GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image
Generation
Paper
•
2504.02782
•
Published
•
54
Inference-Time Scaling for Generalist Reward Modeling
Paper
•
2504.02495
•
Published
•
52
Rethinking RL Scaling for Vision Language Models: A Transparent,
From-Scratch Framework and Comprehensive Evaluation Scheme
Paper
•
2504.02587
•
Published
•
30
ShortV: Efficient Multimodal Large Language Models by Freezing Visual
Tokens in Ineffective Layers
Paper
•
2504.00502
•
Published
•
21
VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via
Iterative Instruction Tuning and Reinforcement Learning
Paper
•
2504.02949
•
Published
•
19
MME-Unify: A Comprehensive Benchmark for Unified Multimodal
Understanding and Generation Models
Paper
•
2504.03641
•
Published
•
13
Slow-Fast Architecture for Video Multi-Modal Large Language Models
Paper
•
2504.01328
•
Published
•
8
URECA: Unique Region Caption Anything
Paper
•
2504.05305
•
Published
•
33
Concept Lancet: Image Editing with Compositional Representation
Transplant
Paper
•
2504.02828
•
Published
•
16
LiveVQA: Live Visual Knowledge Seeking
Paper
•
2504.05288
•
Published
•
13
SmolVLM: Redefining small and efficient multimodal models
Paper
•
2504.05299
•
Published
•
160
Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning
(v1)
Paper
•
2504.03151
•
Published
•
12
Tuning-Free Image Editing with Fidelity and Editability via Unified
Latent Diffusion Model
Paper
•
2504.05594
•
Published
•
11
Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought
Paper
•
2504.05599
•
Published
•
77
Rethinking Reflection in Pre-Training
Paper
•
2504.04022
•
Published
•
74
Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language
Models for Domain-Generalized Semantic Segmentation
Paper
•
2504.03193
•
Published
•
5
OmniSVG: A Unified Scalable Vector Graphics Generation Model
Paper
•
2504.06263
•
Published
•
141
V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric
Capabilities in Multimodal Large Language Models
Paper
•
2504.06148
•
Published
•
12
OmniCaptioner: One Captioner to Rule Them All
Paper
•
2504.07089
•
Published
•
17
Caption Anything in Video: Fine-grained Object-centric Captioning via
Spatiotemporal Multimodal Prompting
Paper
•
2504.05541
•
Published
•
14
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement
Fine-Tuning
Paper
•
2504.06958
•
Published
•
9
Paper
•
2504.07491
•
Published
•
110
DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning
Paper
•
2504.07128
•
Published
•
73
VCR-Bench: A Comprehensive Evaluation Framework for Video
Chain-of-Thought Reasoning
Paper
•
2504.07956
•
Published
•
43
VisualCloze: A Universal Image Generation Framework via Visual
In-Context Learning
Paper
•
2504.07960
•
Published
•
39
MM-IFEngine: Towards Multimodal Instruction Following
Paper
•
2504.07957
•
Published
•
30
Scaling Laws for Native Multimodal Models Scaling Laws for Native
Multimodal Models
Paper
•
2504.07951
•
Published
•
21
Towards Visual Text Grounding of Multimodal Large Language Model
Paper
•
2504.04974
•
Published
•
13
Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model
Paper
•
2504.08685
•
Published
•
108
GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for
Autoregressive Image Generation
Paper
•
2504.08736
•
Published
•
40
MineWorld: a Real-Time and Open-Source Interactive World Model on
Minecraft
Paper
•
2504.08388
•
Published
•
37
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Paper
•
2504.07615
•
Published
•
20
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models
with Reinforcement Learning
Paper
•
2504.08837
•
Published
•
38
FUSION: Fully Integration of Vision-Language Representations for Deep
Cross-Modal Understanding
Paper
•
2504.09925
•
Published
•
36
Have we unified image generation and understanding yet? An empirical
study of GPT-4o's image generation ability
Paper
•
2504.08003
•
Published
•
42
InternVL3: Exploring Advanced Training and Test-Time Recipes for
Open-Source Multimodal Models
Paper
•
2504.10479
•
Published
•
205
Mavors: Multi-granularity Video Representation for Multimodal Large
Language Model
Paper
•
2504.10068
•
Published
•
28
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
Paper
•
2504.09641
•
Published
•
13
VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search
Paper
•
2504.09130
•
Published
•
9
Reasoning Models Can Be Effective Without Thinking
Paper
•
2504.09858
•
Published
•
6
The Scalability of Simplicity: Empirical Analysis of Vision-Language
Learning with a Single Transformer
Paper
•
2504.10462
•
Published
•
12
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
Paper
•
2504.10465
•
Published
•
24