view article Article Releasing the largest multilingual open pretraining dataset By Pclanglais • Nov 13 • 98
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training Paper • 2409.04599 • Published Sep 6 • 1
Structural Priming Demonstrates Abstract Grammatical Representations in Multilingual Language Models Paper • 2311.09194 • Published Nov 15, 2023
Toxicity of the Commons: Curating Open-Source Pre-Training Data Paper • 2410.22587 • Published Oct 29 • 10
Toxicity of the Commons: Curating Open-Source Pre-Training Data Paper • 2410.22587 • Published Oct 29 • 10 • 2
Toxic Commons Collection Tools for de-toxifying public domain data, especially multilingual and historical text data and data with OCR errors. • 3 items • Updated Oct 31 • 5