File names and splits

datasets-examples 's Collections

Manual Configuration

Format: CSV and TSV

Image dataset

Audio dataset

updated Nov 22, 2023

8 datasets showcase the diversity of splits configuration on HuggingFace. See docs: https://huggingface.co./docs/hub/datasets-file-names-and-splits.

Upvote

datasets-examples/doc-splits-1

Viewer • Updated Nov 22, 2023 • 4 • 47

Note Basic use-case. If your dataset isn’t split into train/validation/test splits, the simplest dataset structure is to have one file: data.csv
datasets-examples/doc-splits-2

Viewer • Updated Nov 22, 2023 • 11 • 35

Note You can name your data files after the train, test, and validation splits
datasets-examples/doc-splits-3

Viewer • Updated Nov 22, 2023 • 11 • 41

Note If you don’t have any non-traditional splits, then you can place the split name anywhere in the data file. The only rule is that the split name must be delimited by non-word characters, like test-file.csv for example instead of testfile.csv. Supported delimiters include underscores, dashes, spaces, dots, and numbers.
datasets-examples/doc-splits-4

Viewer • Updated Nov 22, 2023 • 11 • 35

Note You can place your data files into different directories named train, test, and validation where each directory contains the data files for that split.
datasets-examples/doc-splits-5

Viewer • Updated Nov 22, 2023 • 11 • 34

Note There are several ways to refer to train/validation/test splits. Validation splits are sometimes called “dev”, and test splits may be referred to as “eval”. These other split names are also supported, and the following keywords are equivalent: - train, training - validation, valid, val, dev - test, testing, eval, evaluation
datasets-examples/doc-splits-6

Viewer • Updated Nov 22, 2023 • 8 • 37

Note Splits can span several files. Make sure all the files of your train set have train in their names (same for test and validation). You can even add a prefix or suffix to train in the file name (like my_train_file_00001.csv for example).
datasets-examples/doc-splits-7

Viewer • Updated Nov 22, 2023 • 8 • 32

Note For convenience, you can also place your data files into different directories. In this case, the split name is inferred from the directory name.
datasets-examples/doc-splits-8

Viewer • Updated Nov 22, 2023 • 11 • 34

Note If your dataset splits have custom names that aren’t train, test, or validation, then you can name your data files like data/-xxxxx-of-xxxxx.csv.

Upvote