Collections
Discover the best community collections!
Collections including paper arxiv:2406.08407
-
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper • 2405.15223 • Published • 12 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 53 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 85 -
Matryoshka Multimodal Models
Paper • 2405.17430 • Published • 30
-
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper • 2404.12390 • Published • 24 -
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension
Paper • 2404.16790 • Published • 7 -
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots
Paper • 2405.07990 • Published • 16 -
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
Paper • 2406.09411 • Published • 18
-
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension
Paper • 2404.16790 • Published • 7 -
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
Paper • 2406.08407 • Published • 24 -
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI
Paper • 2408.03361 • Published • 85
-
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
Paper • 2404.07973 • Published • 30 -
RegionGPT: Towards Region Understanding Vision Language Model
Paper • 2403.02330 • Published • 2 -
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Paper • 2404.12803 • Published • 29 -
Pegasus-v1 Technical Report
Paper • 2404.14687 • Published • 30
-
Can large language models explore in-context?
Paper • 2403.15371 • Published • 32 -
GaussianCube: Structuring Gaussian Splatting using Optimal Transport for 3D Generative Modeling
Paper • 2403.19655 • Published • 18 -
WavLLM: Towards Robust and Adaptive Speech Large Language Model
Paper • 2404.00656 • Published • 10 -
Enabling Memory Safety of C Programs using LLMs
Paper • 2404.01096 • Published • 1
-
Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding
Paper • 2403.09626 • Published • 13 -
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Paper • 2403.10517 • Published • 31 -
VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis
Paper • 2403.13501 • Published • 9 -
LITA: Language Instructed Temporal-Localization Assistant
Paper • 2403.19046 • Published • 18