SILMA RAGQA V1.0: A Comprehensive Benchmark for Evaluating LLMs on RAG QA Use-Cases

Community Article Published December 18, 2024

SILMA RAGQA is a benchmark curated by silma.ai to assess the effectiveness of Arabic/English Language Models in Extractive Question Answering tasks, with a specific emphasis on RAG applications

The benchmark includes 17 bilingual datasets in Arabic and English, spanning various domains

What capabilities does the benchmark test?

General Arabic and English QA capabilities
Ability to handle short and long contexts
Ability to provide short and long answers effectively
Ability to answer complex numerical questions
Ability to answer questions based on tabular data
Multi-hop question answering: ability to answer one question using pieces of data from multiple paragraphs
Negative Rejection: ability to identify and dismiss inaccurate responses, providing a more precise statement such as "answer can't be found in the provided context."
Multi-domain: ability to answer questions based on texts from different domains such as financial, medical, etc.
Noise Robustness: ability to handle noisy and ambiguous contexts

Data Sources

Name	Lang	Size (Sampled)	Link	Paper
xquad_r	en	100	https://huggingface.co./datasets/google-research-datasets/xquad_r/viewer/en	https://arxiv.org/pdf/2004.05484
xquad_r	ar	100	https://huggingface.co./datasets/google-research-datasets/xquad_r/viewer/ar	https://arxiv.org/pdf/2004.05484
rag_instruct_benchmark_tester	en	100	https://huggingface.co./datasets/llmware/rag_instruct_benchmark_tester	https://medium.com/@darrenoberst/how-accurate-is-rag-8f0706281fd9
covidqa	en	50	https://huggingface.co./datasets/rungalileo/ragbench/viewer/covidqa/test	https://arxiv.org/abs/2407.11005
covidqa	ar	50	translated from covidqa_en using Google Translate	https://arxiv.org/abs/2407.11005
emanual	en	50	https://huggingface.co./datasets/rungalileo/ragbench/viewer/emanual/test	https://arxiv.org/abs/2407.11005
emanual	ar	50	translated from emanual_en using Google Translate	https://arxiv.org/abs/2407.11005
msmarco	en	50	https://huggingface.co./datasets/rungalileo/ragbench/viewer/msmarco/test	https://arxiv.org/abs/2407.11005
msmarco	ar	50	translated from msmarco_en using Google Translate	https://arxiv.org/abs/2407.11005
hotpotqa	en	50	https://huggingface.co./datasets/rungalileo/ragbench/viewer/hotpotqa/test	https://arxiv.org/abs/2407.11005
expertqa	en	50	https://huggingface.co./datasets/rungalileo/ragbench/viewer/expertqa/test	https://arxiv.org/abs/2407.11005
finqa	en	50	https://huggingface.co./datasets/rungalileo/ragbench/viewer/finqa/test	https://arxiv.org/abs/2407.11005
finqa	ar	50	translated from finqa_en using Google Translate	https://arxiv.org/abs/2407.11005
tatqa	en	50	https://huggingface.co./datasets/rungalileo/ragbench/viewer/tatqa/test	https://arxiv.org/abs/2407.11005
tatqa	ar	50	translated from tatqa_en using Google Translate	https://arxiv.org/abs/2407.11005
boolq	ar	100	https://huggingface.co./datasets/Hennara/boolq_ar	https://arxiv.org/pdf/1905.10044
sciq	ar	100	https://huggingface.co./datasets/Hennara/sciq_ar	https://arxiv.org/pdf/1707.06209

SLM Evaluations

SILMA Kashif is a new model will be released early Jan 2025

Model Name	Benchmark Score
SILMA-9B-Instruct-v1.0	0.268
Gemma-2-2b-it	0.281
Qwen2.5-3B-Instruct	0.3
Phi-3.5-mini-instruct	0.301
Gemma-2-9b-it	0.304
Phi-3-mini-128k-instruct	0.306
Llama-3.2-3B-Instruct	0.318
Qwen2.5-7B-Instruct	0.321
Llama-3.1-8B-Instruct	0.328
c4ai-command-r7b-12-2024	0.330
SILMA-Kashif-2B-v0.1	0.357

How to evaluate your model?

Follow the steps on the benchmark page

Upvote