ILMAAM: Index for Language Models For Arabic Assessment on Multitasks

Community Article Published October 6, 2024
image/png

The field of Natural Language Processing (NLP) has made huge strides in recent years, but despite these advancements, the Arabic language has often been left behind. That’s where This is a ILMAAM steps in.

ILMAAM — Index for Language Models for Arabic Assessment on Multitasks—is a comprehensive leaderboard that evaluates Arabic language models across a variety of subjects, offering insights into how well these models handle multitask learning.

Why ILMAAM?

Arabic is one of the most widely spoken languages globally, yet there’s a significant gap in how well large language models are evaluated for Arabic-specific tasks. Most NLP models are developed and fine-tuned on English datasets, which limits their utility for Arabic speakers and researchers.

ILMAAM aims to close that gap by providing a standardized benchmark to assess how Arabic large language models or models that have seen some Arabic tokens perform on various tasks, including abstract algebra, clinical knowledge, high school-level subjects, and more. It is designed to offer a thorough evaluation process for pretrained and instruction-tuned models, specifically focusing on the unique linguistic challenges of Arabic.

image/png


The Structure of ILMAAM

Our leaderboard until now showcases 29 high-performing models, including the recently added Llama, Jais, cohere, and Qwen models.

These models span two key categories:

  • Pretrained Models: Trained on vast amounts of text data without specific instructions.
  • Instruction-tuned Models: Fine-tuned for specific tasks, helping them perform better in multitask scenarios.

The models on ILMAAM are evaluated on their performance across different subjects, each representing a critical area of study, from elementary mathematics to international law. This ensures that the models are versatile and can adapt to a range of real-world use cases.


Cultural Alignment and Subject Selection

In ILMAAM, we are committed to ensuring that all evaluated models are aligned with cultural and ethical considerations, especially in the context of Arabic language and cultural norms. For this reason, while we evaluate models on 100 questions per subject across a wide range of topics, certain subjects have been excluded to ensure alignment with local cultural sensitivities. Subjects such as moral disputes, jurisprudence, and others have been left out of the evaluation to avoid cultural misalignment.

This decision stems from the need to develop benchmarks that not only test model performance but also respect the nuances of Arabic-speaking societies.

As ArabicNLP continues to grow, ensuring cultural relevance remains a core aspect of our evaluation process. ILMAAM sets a precedent in balancing technical assessment with culturally informed judgments, helping both the research community and organizations make informed decisions.


Performance Evaluation Highlights đź“Š

image/png

After conducting evaluations on the most popular Arabic LLMs, the results are in—and they reveal a highly competitive landscape!

đź“Š Performance Highlights:

🔹Top-performing models include Qwen 2.5-32B-Instruct, CohereForAI c4ai-command, and Google's Gemma series—all showing impressive results across a wide array of subjects.

🔹Qwen 2.5-32B-Instruct led the leaderboard with an average accuracy of 60.27%, excelling in high_school_government_and_politics with a stellar 77% accuracy and performing notably well in high_school_statistics at 70%.

🔹CohereForAI c4ai-command followed closely with 59.85%, standing out with an 86% accuracy in high_school_us_history.

🔹Google's Gemma 2-9b claimed third place, scoring 57.73%, with an outstanding 79% accuracy in high_school_statistics.

🔹Qwen 2.5-7B-Instruct put up a strong fight, achieving 55.57%, and performing well in high_school_government_and_politics (71%) and high_school_statistics (66%).

Key Insights:

🔸Qwen series models consistently performed well, especially in social sciences like government and politics.

🔸CohereForAI models demonstrated strong knowledge in history and social sciences, with impressive results in both US and European history.

🔸Google's Gemma models showed steady performance across all subjects, demonstrating a well-rounded knowledge base.

🔸Interestingly, the best-performing model on the Open-Arabic-LLM-Leaderboard—silma-ai/SILMA-9B-Instruct-v1.0—ranked fifth in this MMMLU evaluation, with an overall accuracy of 53.33%.

We've now expanded ILMAAM with 29 top-performing models, including the latest models from Meta's Llama series and Inception's Jais models. However, Qwen LLM from Alibaba Cloud continues to sit at the top of the leaderboard, showcasing its strong performance in the evolving world of Arabic NLP.


The Meaning Behind ILMAAM 🔍

The name ILMAAM holds deep significance in both its acronym and its Arabic roots. ILMAAM stands for the Index for Language Models for Arabic Assessment on Multitasks, but the word itself has a rich meaning in Arabic.

In Arabic, "إلمام" (ILMAAM) translates to "comprehensive knowledge" or "awareness." This reflects the mission of the leaderboard perfectly—it’s about gaining a complete understanding of how various language models perform across a wide spectrum of tasks, particularly within the Arabic context. The name symbolizes both the breadth and depth of evaluation, as ILMAAM aims to offer a holistic view of Arabic NLP model performance.

The goal of ILMAAM is to go beyond simple benchmarks and provide a comprehensive index that researchers, developers, and organizations can rely on to make informed choices about which models are best suited for their needs in multitask assessments. By carefully curating and evaluating models, ILMAAM stays true to its name, offering a platform where knowledge is comprehensive, well-rounded, and thorough—just like the meaning of ILMAAM itself.


How to Submit Your Model 🚀

Interested in seeing how your model performs on ILMAAM? Follow these simple steps to submit your model:

  1. Prepare Your Model: Ensure that your model is available on Hugging Face and that it supports Arabic language tasks.

  2. Submit via The leaderboard submission section e:

    • Navigate to the submission page on ILMAAM’s Hugging Face repository.
    • Provide your model's name, precision, weight type, and other relevant metadata.
  3. Evaluation: Once submitted, your model will be automatically evaluated across 50 subjects on 100 questions per subject. The results will be displayed on the ILMAAM leaderboard, including the model's overall accuracy performance.

  4. Review the Results: After evaluation, your model's results will appear on the leaderboard after I manually do restart each 24 hours or pull a request if you want to see the results in shrot time.

👉 Leaderborad Link : https://huggingface.co./spaces/Omartificial-Intelligence-Space/Arabic-MMMLU-Leaderborad


Citation

If you use this leaderboard or the MMMLU dataset in your research, please cite:

@misc{ILMAAM,
  author = {Nacar, Omer},
  title = {ILMAAM: Index for Language Models For Arabic Assessment on Multitasks},
  year = {2024},
  publisher = {Robotics and Internet-of-Things Lab, Prince Sultan University, Riyadh}
}