metadata

license: apache-2.0
tags:
  - merge
base_model:
  - CohereForAI/aya-23-8B
  - google/siglip-base-patch16-256-multilingual
datasets:
  - maya-multimodal/pretrain
  - MBZUAI/palo_multilingual_dataset
language:
  - en
  - hi
  - fr
  - ru
  - zh
  - ar
  - ja
  - es
pipeline_tag: image-text-to-text
library_name: transformers

Maya: A Multilingual Vision Language Model

Maya is an instruction-finetuned multilingual multimodal model that expands multimodal capabilities to eight languages with an emphasis on data quality and cultural sensitivity. Built on the LLaVA framework, Maya includes a newly created pre-training dataset designed to support multilingual and culturally aware VLM development.

Model Description

Developed by: Cohere For AI Community
Model type: Multimodal Vision-Language Model
Language(s): English, Chinese, French, Spanish, Russian, Japanese, Arabic, Hindi
License: Apache 2.0
Related Paper: Maya: An Instruction Finetuned Multilingual Multimodal Model

Model Details

Maya leverages the lightweight architecture to provide a compact yet powerful multimodal experience, with several key features:

Built on LLaVA framework using Aya-23 8B model
Uses SigLIP for vision encoding with multilingual adaptability
Supports 8 languages with strong cultural understanding
Trained on toxicity-filtered dataset for safer deployment

Model Architecture

Base Model: Aya-23 8B
Vision Encoder: SigLIP (multilingual)
Training Data: 558,000 images with multilingual annotations
Context Length: 8K tokens
Parameters: 8 billion

Intended Uses

Maya is designed for:

Multilingual visual question answering
Cross-cultural image understanding
Image captioning in multiple languages
Visual reasoning tasks
Document understanding

Usage

# Clone the Github repository
git clone https://github.com/nahidalam/maya

# Change the working directory
cd maya

# Run the following code
from llava.eval.talk2maya import run_vqa_model

# Define inputs
question = "Try identify what plane this is, based on the design."
image_path = "./llava/eval/claude_plane_test_2.jpeg" 

# Run model
answer = run_vqa_model(
    question=question,
    image_file=image_path
)

Limitations

Limited to 8 languages currently
Requires high-quality images for optimal performance
May not capture nuanced cultural contexts in all cases
Performance varies across languages and tasks

Bias, Risks, and Limitations

Maya has been developed with attention to bias mitigation and safety:

Dataset filtered for toxic content
Cultural sensitivity evaluations performed
Regular bias assessments conducted
Limited to high-quality, vetted training data

However, users should be aware that:

Model may still exhibit biases present in training data
Performance may vary across different cultural contexts
Not suitable for critical decision-making applications

Training Details

Maya was trained using:

558,000 curated images
Multilingual annotations in 8 languages
Toxicity-filtered dataset
8xH100 GPUs with 80GB DRAM
Batch size of 32 (per device)
Learning rate of 1e-3 with cosine scheduler

Citation

@misc{alam2024mayainstructionfinetunedmultilingual,
      title={Maya: An Instruction Finetuned Multilingual Multimodal Model}, 
      author={Nahid Alam and Karthik Reddy Kanjula and Surya Guthikonda and Timothy Chung and Bala Krishna S Vegesna and Abhipsha Das and Anthony Susevski and Ryan Sze-Yin Chan and S M Iftekhar Uddin and Shayekh Bin Islam and Roshan Santhosh and Snegha A and Drishti Sharma and Chen Liu and Isha Chaturvedi and Genta Indra Winata and Ashvanth. S and Snehanshu Mukherjee and Alham Fikri Aji},
      year={2024},
      eprint={2412.07112},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.07112}, 
}

Contact

For questions or feedback about Maya, please:

Open an issue on our GitHub repository
Contact the maintainers at: [email protected], [email protected]