Image-Text-to-Text
Transformers
Safetensors
llava_cohere
text-generation
Merge
conversational
Inference Endpoints
maya / README.md
nahidalam's picture
Update README.md
0bddd4a verified
metadata
license: apache-2.0
tags:
  - merge
base_model:
  - CohereForAI/aya-23-8B
  - google/siglip-base-patch16-256-multilingual
datasets:
  - maya-multimodal/pretrain
  - MBZUAI/palo_multilingual_dataset
language:
  - en
  - hi
  - fr
  - ru
  - zh
  - ar
  - ja
  - es
pipeline_tag: image-text-to-text
library_name: transformers

Maya: A Multilingual Vision Language Model

Maya is an instruction-finetuned multilingual multimodal model that expands multimodal capabilities to eight languages with an emphasis on data quality and cultural sensitivity. Built on the LLaVA framework, Maya includes a newly created pre-training dataset designed to support multilingual and culturally aware VLM development.

Model Description

Model Details

Maya leverages the lightweight architecture to provide a compact yet powerful multimodal experience, with several key features:

  • Built on LLaVA framework using Aya-23 8B model
  • Uses SigLIP for vision encoding with multilingual adaptability
  • Supports 8 languages with strong cultural understanding
  • Trained on toxicity-filtered dataset for safer deployment

Model Architecture

  • Base Model: Aya-23 8B
  • Vision Encoder: SigLIP (multilingual)
  • Training Data: 558,000 images with multilingual annotations
  • Context Length: 8K tokens
  • Parameters: 8 billion

Intended Uses

Maya is designed for:

  • Multilingual visual question answering
  • Cross-cultural image understanding
  • Image captioning in multiple languages
  • Visual reasoning tasks
  • Document understanding

Usage

# Clone the Github repository
git clone https://github.com/nahidalam/maya

# Change the working directory
cd maya
# Run the following code
from llava.eval.talk2maya import run_vqa_model

# Define inputs
question = "Try identify what plane this is, based on the design."
image_path = "./llava/eval/claude_plane_test_2.jpeg" 

# Run model
answer = run_vqa_model(
    question=question,
    image_file=image_path
)

Limitations

  • Limited to 8 languages currently
  • Requires high-quality images for optimal performance
  • May not capture nuanced cultural contexts in all cases
  • Performance varies across languages and tasks

Bias, Risks, and Limitations

Maya has been developed with attention to bias mitigation and safety:

  • Dataset filtered for toxic content
  • Cultural sensitivity evaluations performed
  • Regular bias assessments conducted
  • Limited to high-quality, vetted training data

However, users should be aware that:

  • Model may still exhibit biases present in training data
  • Performance may vary across different cultural contexts
  • Not suitable for critical decision-making applications

Training Details

Maya was trained using:

  • 558,000 curated images
  • Multilingual annotations in 8 languages
  • Toxicity-filtered dataset
  • 8xH100 GPUs with 80GB DRAM
  • Batch size of 32 (per device)
  • Learning rate of 1e-3 with cosine scheduler

Citation

@misc{alam2024mayainstructionfinetunedmultilingual,
      title={Maya: An Instruction Finetuned Multilingual Multimodal Model}, 
      author={Nahid Alam and Karthik Reddy Kanjula and Surya Guthikonda and Timothy Chung and Bala Krishna S Vegesna and Abhipsha Das and Anthony Susevski and Ryan Sze-Yin Chan and S M Iftekhar Uddin and Shayekh Bin Islam and Roshan Santhosh and Snegha A and Drishti Sharma and Chen Liu and Isha Chaturvedi and Genta Indra Winata and Ashvanth. S and Snehanshu Mukherjee and Alham Fikri Aji},
      year={2024},
      eprint={2412.07112},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.07112}, 
}

Contact

For questions or feedback about Maya, please: