--- license: apache-2.0 tags: - merge base_model: - CohereForAI/aya-23-8B - google/siglip-base-patch16-256-multilingual datasets: - maya-multimodal/pretrain - MBZUAI/palo_multilingual_dataset language: - en - hi - fr - ru - zh - ar - ja - es pipeline_tag: image-text-to-text library_name: transformers --- # Maya: A Multilingual Vision Language Model Maya is an instruction-finetuned multilingual multimodal model that expands multimodal capabilities to eight languages with an emphasis on data quality and cultural sensitivity. Built on the LLaVA framework, Maya includes a newly created pre-training dataset designed to support multilingual and culturally aware VLM development. ## Model Description - **Developed by:** Cohere For AI Community - **Model type:** Multimodal Vision-Language Model - **Language(s):** English, Chinese, French, Spanish, Russian, Japanese, Arabic, Hindi - **License:** Apache 2.0 - **Related Paper:** [Maya: An Instruction Finetuned Multilingual Multimodal Model](https://arxiv.org/abs/2412.07112) ## Model Details Maya leverages the lightweight architecture to provide a compact yet powerful multimodal experience, with several key features: - Built on LLaVA framework using Aya-23 8B model - Uses SigLIP for vision encoding with multilingual adaptability - Supports 8 languages with strong cultural understanding - Trained on toxicity-filtered dataset for safer deployment ### Model Architecture - **Base Model:** Aya-23 8B - **Vision Encoder:** SigLIP (multilingual) - **Training Data:** 558,000 images with multilingual annotations - **Context Length:** 8K tokens - **Parameters:** 8 billion ## Intended Uses Maya is designed for: - Multilingual visual question answering - Cross-cultural image understanding - Image captioning in multiple languages - Visual reasoning tasks - Document understanding ## Usage ```bash # Clone the Github repository git clone https://github.com/nahidalam/maya # Change the working directory cd maya ``` ```python # Run the following code from llava.eval.talk2maya import run_vqa_model # Define inputs question = "Try identify what plane this is, based on the design." image_path = "./llava/eval/claude_plane_test_2.jpeg" # Run model answer = run_vqa_model( question=question, image_file=image_path ) ``` ## Limitations - Limited to 8 languages currently - Requires high-quality images for optimal performance - May not capture nuanced cultural contexts in all cases - Performance varies across languages and tasks ## Bias, Risks, and Limitations Maya has been developed with attention to bias mitigation and safety: - Dataset filtered for toxic content - Cultural sensitivity evaluations performed - Regular bias assessments conducted - Limited to high-quality, vetted training data However, users should be aware that: - Model may still exhibit biases present in training data - Performance may vary across different cultural contexts - Not suitable for critical decision-making applications ## Training Details Maya was trained using: - 558,000 curated images - Multilingual annotations in 8 languages - Toxicity-filtered dataset - 8xH100 GPUs with 80GB DRAM - Batch size of 32 (per device) - Learning rate of 1e-3 with cosine scheduler ## Citation ```bibtex @misc{alam2024mayainstructionfinetunedmultilingual, title={Maya: An Instruction Finetuned Multilingual Multimodal Model}, author={Nahid Alam and Karthik Reddy Kanjula and Surya Guthikonda and Timothy Chung and Bala Krishna S Vegesna and Abhipsha Das and Anthony Susevski and Ryan Sze-Yin Chan and S M Iftekhar Uddin and Shayekh Bin Islam and Roshan Santhosh and Snegha A and Drishti Sharma and Chen Liu and Isha Chaturvedi and Genta Indra Winata and Ashvanth. S and Snehanshu Mukherjee and Alham Fikri Aji}, year={2024}, eprint={2412.07112}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2412.07112}, } ``` ## Contact For questions or feedback about Maya, please: - Open an issue on our [GitHub repository](https://github.com/nahidalam/maya) - Contact the maintainers at: nahid.m.alam@gmail.com, maya.c4ai@gmail.com