mlcroissant

mlcroissant is a library to load datasets from Croissant metadata.

💡 Learn more about how to get the metadata from the dataset viewer API in the Get Croissant metadata guide.

Let’s start by parsing the Croissant metadata for the tasksource/blog_authorship_corpus dataset. Be sure to first install mlcroissant[parquet] and GitPython to be able to load Parquet files over the git+https protocol.

from mlcroissant import Dataset
ds = Dataset(jsonld="https://huggingface.co./api/datasets/tasksource/blog_authorship_corpus/croissant")

To read from the first subset (called RecordSet in Croissant’s vocabulary), use the records function, which returns an iterator of dicts.

records = ds.records("default")

Finally use Pandas to compute your query on the first 1,000 rows:

import itertools

import pandas as pd

df = (
    pd.DataFrame(list(itertools.islice(records, 100)))
    .groupby("default/sign")["default/text"]
    .apply(lambda x: x.str.len().mean())
    .sort_values(ascending=False)
    .head(5)
)
print(df)
default/sign
b'Leo'          6463.500000
b'Capricorn'    2374.500000
b'Aquarius'     2303.757143
b'Gemini'       1420.333333
b'Aries'         918.666667
Name: default/text, dtype: float64

< > Update on GitHub

Dataset viewer

mlcroissant