-
VoCo-LLaMA: Towards Vision Compression with Large Language Models
Paper • 2406.12275 • Published • 29 -
TroL: Traversal of Layers for Large Language and Vision Models
Paper • 2406.12246 • Published • 34 -
Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
Paper • 2406.15334 • Published • 8 -
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
Paper • 2406.12742 • Published • 14
Collections
Discover the best community collections!
Collections including paper arxiv:2408.08872
-
VoCo-LLaMA: Towards Vision Compression with Large Language Models
Paper • 2406.12275 • Published • 29 -
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models
Paper • 2405.15738 • Published • 43 -
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Paper • 2408.08872 • Published • 97
-
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper • 2405.15223 • Published • 12 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 53 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 85 -
Matryoshka Multimodal Models
Paper • 2405.17430 • Published • 31
-
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Paper • 2311.17049 • Published -
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Paper • 2405.04434 • Published • 13 -
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
Paper • 2303.17376 • Published -
Sigmoid Loss for Language Image Pre-Training
Paper • 2303.15343 • Published • 4
-
Non-Autoregressive Neural Machine Translation
Paper • 1711.02281 • Published • 1 -
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
Paper • 2107.07651 • Published • 1 -
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Paper • 2404.07972 • Published • 46 -
AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning
Paper • 2402.15506 • Published • 13
-
The Curious Case of Neural Text Degeneration
Paper • 1904.09751 • Published • 3 -
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Paper • 2404.01197 • Published • 30 -
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
Paper • 1905.10044 • Published • 1 -
PIQA: Reasoning about Physical Commonsense in Natural Language
Paper • 1911.11641 • Published • 2
-
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Paper • 2305.18290 • Published • 48 -
ICDPO: Effectively Borrowing Alignment Capability of Others via In-context Direct Preference Optimization
Paper • 2402.09320 • Published • 6 -
sDPO: Don't Use Your Data All at Once
Paper • 2403.19270 • Published • 39 -
Dueling RL: Reinforcement Learning with Trajectory Preferences
Paper • 2111.04850 • Published • 2
-
FaceChain-SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven Generation
Paper • 2403.06775 • Published • 3 -
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Paper • 2010.11929 • Published • 6 -
Data Incubation -- Synthesizing Missing Data for Handwriting Recognition
Paper • 2110.07040 • Published • 2 -
A Mixture of Expert Approach for Low-Cost Customization of Deep Neural Networks
Paper • 1811.00056 • Published • 2
-
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
Paper • 2402.14289 • Published • 19 -
ImageBind: One Embedding Space To Bind Them All
Paper • 2305.05665 • Published • 3 -
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 181 -
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts
Paper • 2206.02770 • Published • 3