@ahmed-masry on Hugging Face: "Happy to announce AlignVLM 📏 – a novel approach to bridging vision and…"

Post

380

Happy to announce AlignVLM 📏 – a novel approach to bridging vision and language latent spaces for multimodal understanding in Vision-Language Models (VLMs) 🌍📄🖼

🔗 Read the paper: AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding (2502.01341)

🧐 What’s the challenge?
Aligning visual features with language embeddings remains a major bottleneck in VLMs. Existing connectors such as Multi-layer perceptron (MLPs) often introduce noise that degrades performance. ❌

🎯 Our Solution: ALIGN Connector
We propose AlignVLM, a method that maps vision features into a weighted average of LLM text embeddings, ensuring they remain in a space that the LLM can effectively interpret. ✅

🔬 How does it perform?
We compared ALIGN against common connectors like MLPs, Perceiver Resampler, and Ovis trained under similar configurations. The results? ALIGN outperforms them all 🏆 on diverse document understanding tasks 📄.

📊 Meet the AlignVLM Model Family!
We trained Llama 3.1 (1B, 3B, 8B) using our connector and benchmarked them against various models. The results:
✅ AlignVLM surpasses all Base VLMs trained under similar configurations. ✅ Our models also perform competitively against Instruct VLMs such as Qwen2-VL and InternVL-2.5 🚀.

🤔 What about robustness to noise?
We injected Gaussian noise (μ=0, σ=3) into the vision encoder’s outputs before feeding them to the connector:
✅ ALIGN Connector: Minimal drop (↓1.67%) – proving its high robustness!
❌ MLP Connector: Severe degradation (↓25.54%) – struggling with noisy inputs.

Code & model weights coming soon! Stay tuned! 🔥

Join the conversation