metadata
license: apache-2.0
tags:
- merge
base_model:
- CohereForAI/aya-23-8B
- google/siglip-base-patch16-256-multilingual
datasets:
- maya-multimodal/pretrain
- MBZUAI/palo_multilingual_dataset
language:
- en
- hi
- fr
- ru
- zh
- ar
- ja
- es
pipeline_tag: image-text-to-text
library_name: transformers
Maya: A Multilingual Vision Language Model
Maya is an instruction-finetuned multilingual multimodal model that expands multimodal capabilities to eight languages with an emphasis on data quality and cultural sensitivity. Built on the LLaVA framework, Maya includes a newly created pre-training dataset designed to support multilingual and culturally aware VLM development.
Model Description
- Developed by: Cohere For AI Community
- Model type: Multimodal Vision-Language Model
- Language(s): English, Chinese, French, Spanish, Russian, Japanese, Arabic, Hindi
- License: Apache 2.0
- Related Paper: Maya: An Instruction Finetuned Multilingual Multimodal Model
Model Details
Maya leverages the lightweight architecture to provide a compact yet powerful multimodal experience, with several key features:
- Built on LLaVA framework using Aya-23 8B model
- Uses SigLIP for vision encoding with multilingual adaptability
- Supports 8 languages with strong cultural understanding
- Trained on toxicity-filtered dataset for safer deployment
Model Architecture
- Base Model: Aya-23 8B
- Vision Encoder: SigLIP (multilingual)
- Training Data: 558,000 images with multilingual annotations
- Context Length: 8K tokens
- Parameters: 8 billion
Intended Uses
Maya is designed for:
- Multilingual visual question answering
- Cross-cultural image understanding
- Image captioning in multiple languages
- Visual reasoning tasks
- Document understanding
Usage
# Clone the Github repository
git clone https://github.com/nahidalam/maya
# Change the working directory
cd maya
# Run the following code
from llava.eval.talk2maya import run_vqa_model
# Define inputs
question = "Try identify what plane this is, based on the design."
image_path = "./llava/eval/claude_plane_test_2.jpeg"
# Run model
answer = run_vqa_model(
question=question,
image_file=image_path
)
Limitations
- Limited to 8 languages currently
- Requires high-quality images for optimal performance
- May not capture nuanced cultural contexts in all cases
- Performance varies across languages and tasks
Bias, Risks, and Limitations
Maya has been developed with attention to bias mitigation and safety:
- Dataset filtered for toxic content
- Cultural sensitivity evaluations performed
- Regular bias assessments conducted
- Limited to high-quality, vetted training data
However, users should be aware that:
- Model may still exhibit biases present in training data
- Performance may vary across different cultural contexts
- Not suitable for critical decision-making applications
Training Details
Maya was trained using:
- 558,000 curated images
- Multilingual annotations in 8 languages
- Toxicity-filtered dataset
- 8xH100 GPUs with 80GB DRAM
- Batch size of 32 (per device)
- Learning rate of 1e-3 with cosine scheduler
Citation
@misc{alam2024mayainstructionfinetunedmultilingual,
title={Maya: An Instruction Finetuned Multilingual Multimodal Model},
author={Nahid Alam and Karthik Reddy Kanjula and Surya Guthikonda and Timothy Chung and Bala Krishna S Vegesna and Abhipsha Das and Anthony Susevski and Ryan Sze-Yin Chan and S M Iftekhar Uddin and Shayekh Bin Islam and Roshan Santhosh and Snegha A and Drishti Sharma and Chen Liu and Isha Chaturvedi and Genta Indra Winata and Ashvanth. S and Snehanshu Mukherjee and Alham Fikri Aji},
year={2024},
eprint={2412.07112},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.07112},
}
Contact
For questions or feedback about Maya, please:
- Open an issue on our GitHub repository
- Contact the maintainers at: [email protected], [email protected]