Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
Paper
•
2412.03304
•
Published
•
17
Russian speakers working on prompt translation as a part of the Data is Better Together initiative, building impactful community datasets.
ds = load_dataset("HuggingFaceH4/OpenHermesPreferences", split="train")
# Get the categories of the source dataset
# ['airoboros2.2', 'CamelAI', 'caseus_custom', ...]
sources = ds.unique("source")
# Filter for a subset
ds_filtered = ds.filter(lambda x : x["source"] in ["metamath", "EvolInstruct_70k"], num_proc=6)