Pablo Montalvo's picture

Pablo Montalvo PRO

Molbap

AI & ML interests

None yet

Recent Activity

updated a model 15 days ago
Molbap/molmo-hf-7B-D
updated a model 26 days ago
Molbap/molmo-hf-72B
liked a model 27 days ago
yonigozlan/GOT-OCR-2.0-hf
View all activity

Articles

Organizations

Hugging Face's profile picture Hugging Face Internal Testing Organization's profile picture Hugging Face Smol Cluster's profile picture adept-hf-collab's profile picture kotol's profile picture Pixel Parsing's profile picture Social Post Explorers's profile picture ibm-ai-platform's profile picture Tinkering's profile picture Dev Mode Explorers's profile picture Paris AI Running Club's profile picture yorg's profile picture gg-tt's profile picture

Posts 1

view post
Post
5077
๐Ÿš€๐Ÿš€ Exciting times for the document AI community!

We're thrilled to announce the release of some of the largest OCR datasets available to the public.
๐Ÿ”ฅ With over 26 million pages , 18 billion text tokens, and 6TB of data, these resources are a significant leap forward for document AI research.

Here's how to access these datasets quickly:

from datasets import load_dataset

pdfa_dataset = load_dataset('pixparse/pdfa-eng-wds', streaming=True)
IDL_dataset = load_dataset('pixparse/idl-wds', streaming=True)

This enables you to stream them directly, integrating seamlessly with your projects using the Hugging Face datasets library. On the hub, you can find them here:

pixparse/pdfa-eng-wds
pixparse/idl-wds

For lean data loading, the new [chug](https://github.com/huggingface/chug) library offers a solution with pdf decoding:


import chug

task_cfg = chug.DataTaskDocReadCfg(
    page_sampling='all',
)
data_cfg = chug.DataCfg(
    source='pixparse/pdfa-eng-wds',
    split='train',
    batch_size=None,
    format='hfids',
    num_workers=0,
)
data_loader = chug.create_loader(
    data_cfg,
    task_cfg,
)
sample = next(iter(data_loader))



We owe a huge thank you to Peter Wyatt, Kate Tasker, Rachel Taketa, Ali Furkan Biten, Ruben Tito, and their colleagues for their contributions. Their work putting these datasets together has been invaluable. ๐Ÿค—

Looking Ahead:

We're on a mission to enhance document AI capabilities, and these datasets are just the beginning. With your engagement and innovation, we're confident in the community's ability to develop robust OCR solutions. We encourage you to explore these datasets, experiment with the code, and contribute to the collective progress in document AI.

For detailed information on usage and licensing, please refer to the dataset cards on the Hugging Face hub.

datasets

None public yet