Fasttext model used for filtering in DataComp-LM to produce DCLM-Baseline.

The model classifies between __label__hq and __label__cc which correspond to "high-quality" (i.e., OH2.5 and Reddit ELI5 data) and "low-quality" (i.e., web-crawled data from Common Crawl) respectively. We use the score given to __label__hq to filter our documents via a percentile-based threshold.

See our dclm repo for documentation about how we applied to to filter data in our experiments.

See fasttext documentation for general documentation on fasttext classifiers and how to use them with python.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.