a new experimental model that unlocks stronger reasoning capabilities and shows its thoughts. The model plans (with thoughts visible), can solve complex problems with Flash speeds, and more
We applied the same data-driven approach that led to SOTA English performance inπ· FineWeb to thousands of languages.
π₯ FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.
The dataset is released under the permissive π ODC-By 1.0 license, and the π» code to reproduce it and our evaluations is public.
We will very soon announce a big community project, and are working on a π blogpost walking you through the entire dataset creation process. Stay tuned!