Fineweb-Edu-Ar: Machine-translated Corpus to Support Arabic Small Language Models
Abstract
As large language models (LLMs) grow and develop, so do their data demands. This is especially true for multilingual LLMs, where the scarcity of high-quality and readily available data online has led to a multitude of synthetic dataset generation approaches. A key technique in this space is machine translation (MT), where high-quality English text is adapted to a target, comparatively low-resource language. This report introduces FineWeb-Edu-Ar, a machine-translated version of the exceedingly popular (deduplicated) FineWeb-Edu dataset from HuggingFace. To the best of our knowledge, FineWeb-Edu-Ar is the largest publicly available machine-translated Arabic dataset out there, with its size of 202B tokens of an Arabic-trained tokenizer.
Community
Discussion
Hey everyone! I'm excited to share our work on FineWeb-Edu-Ar, a large-scale Arabic dataset we created by machine-translating the FineWeb-Edu corpus using NLLB-200. With 202 billion tokens of paired Arabic-English text, we hope this dataset will help advance research in Arabic small language models, particularly for researchers working with limited computational resources. We've included both the original English and translated Arabic versions to facilitate future research in cross-lingual learning. I'm particularly interested in hearing the community's thoughts on using machine translation for creating training data and potential applications for paired multilingual datasets in smaller models, especially in Arabic. Do you all have any suggestions for what could have been improved? Looking forward to discussing this with you all :)
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper