Dataset viewer documentation

Overview

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Overview

The dataset viewer automatically converts and publishes public datasets less than 5GB on the Hub as Parquet files. If the dataset is already in Parquet format, it will be published as is. Parquet files are column-based and they shine when you’re working with big data.

For private datasets, the feature is provided if the repository is owned by a PRO user or an Enterprise Hub organization.

There are several different libraries you can use to work with the published Parquet files:

  • ClickHouse, a column-oriented database management system for online analytical processing
  • cuDF, a Python GPU DataFrame library
  • DuckDB, a high-performance SQL database for analytical queries
  • Pandas, a data analysis tool for working with data structures
  • Polars, a Rust based DataFrame library
  • PostgreSQL via pgai, a powerful, open source object-relational database system
  • mlcroissant, a library for loading datasets from Croissant metadata
  • pyspark, the Python API for Apache Spark
< > Update on GitHub