Downloading datasets
Integrated libraries
If a dataset on the Hub is tied to a supported library, loading the dataset can be done in just a few lines. For information on accessing the dataset, you can click on the “Use this dataset” button on the dataset page to see how to do so. For example, samsum
shows how to do so with 🤗 Datasets below.
Using the Hugging Face Client Library
You can use the huggingface_hub
library to create, delete, update and retrieve information from repos. You can also download files from repos or integrate them into your library! For example, you can quickly load a CSV dataset with a few lines using Pandas.
from huggingface_hub import hf_hub_download
import pandas as pd
REPO_ID = "YOUR_REPO_ID"
FILENAME = "data.csv"
dataset = pd.read_csv(
hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset")
)
Using Git
Since all datasets on the Hub are Git repositories, you can clone the datasets locally by running:
git lfs install
git clone [email protected]:datasets/<dataset ID> # example: git clone [email protected]:datasets/allenai/c4
If you have write-access to the particular dataset repo, you’ll also have the ability to commit and push revisions to the dataset.
Add your SSH public key to your user settings to push changes and/or access private repos.
< > Update on GitHub