Overview

The dataset viewer automatically converts and publishes public datasets less than 5GB on the Hub as Parquet files. If the dataset is already in Parquet format, it will be published as is. Parquet files are column-based and they shine when you’re working with big data.

For private datasets, the feature is provided if the repository is owned by a PRO user or an Enterprise Hub organization.

There are several different libraries you can use to work with the published Parquet files:

ClickHouse, a column-oriented database management system for online analytical processing
cuDF, a Python GPU DataFrame library
DuckDB, a high-performance SQL database for analytical queries
Pandas, a data analysis tool for working with data structures
Polars, a Rust based DataFrame library
PostgreSQL via pgai, a powerful, open source object-relational database system
mlcroissant, a library for loading datasets from Croissant metadata
pyspark, the Python API for Apache Spark

< > Update on GitHub

Dataset viewer

Overview