Jared Sulzdorf PRO
AI & ML interests
Recent Activity
Articles
Organizations
jsulz's activity
TL;DR:
- public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible
- private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)
docs: https://huggingface.co./docs/hub/storage-limits
We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community π₯
cc: @reach-vb @pierric @victor and the HF team
Global-MMLU is the result of months of work with the goal of advancing Multilingual LLM evaluation. It's been an amazing open science effort with collaborators from Cohere For AI, Mila - Quebec Artificial Intelligence Institute, EPFL, Massachusetts Institute of Technology, AI Singapore, National University of Singapore, KAIST, Instituto Superior TΓ©cnico, Carnegie Mellon University, CONICET, and University of Buenos Aires.
π·οΈ +200 contributors used Argilla MMLU questions where regional, dialect, or cultural knowledge was required to answer correctly. 85% of the questions required Western-centric knowledge!
Thanks to this annotation process, the open dataset contains two subsets:
1. π½ Culturally Agnostic: no specific regional, cultural knowledge is required.
2. βοΈ Culturally Sensitive: requires dialect, cultural knowledge or geographic knowledge to answer correctly.
Moreover, we provide high quality translations of 25 out of 42 languages, thanks again to the community and professional annotators leveraging Argilla on the Hub.
I hope this will ensure a better understanding of the limitations and challenges for making open AI useful for many languages.
Dataset: CohereForAI/Global-MMLU
Instructive to see which models serve as the "nodes" of the Hub!
Check it out: huggingface/open-source-ai-year-in-review-2024
I thought big and complex repos would be fun to visualize and they can be! This image is from blanchon/RESISC45, a repo with 31,000 images from Google Earth, each bucketed into one of 45 taxonomies with 700 images per taxonomy:
But more fun is when you find a repository that is structured (naming conventions and directories) in a way that lets you see the inequity in the bytes.
This is most apparent in NLP datasets that are multilingual, similar to the wikimedia/wikipedia dataset. If you zoom in on any of these (or run them yourself in the Space) you'll see a directory or file naming convention using the language abbreviation. Sections that near yellow for directories or files == more bytes devoted to that language.
Here's facebook/multilingual_librispeech:
and mozilla-foundation/common_voice_17_0:
and google/xtreme:
and unolp/CulturaX:
Each dataset shows some imbalance in the languages represented, and this pattern holds true for other types of datasets as well. However, such discrepancies can be harder to spot when folder or file naming conventions prioritize machine over human readability.
Another fun example is the nguha/legalbench dataset, designed to evaluate legal reasoning in LLMs. It provides a clear view of the types of reasoning being tested:
Although you might have to squint to see the labels. This is one where it might be best to head over to the Space https://huggingface.co./spaces/jsulz/repo-info and see it for yourself ;)
Datasets are among my favorite to visualize because of their mixture of files and folder structures. Here's the huggingface/documentation-images where alongside documentation images we store images for the Hugging Face blog:
I also enjoy the wikimedia/wikipedia dataset. It's fascinating to see the distribution of bytes across languages.
Some datasets are actually quite difficult to visualize because the number of points in the Plotly graph cause the browser to crash on render. It's quite possible you'll run into this if you use the Space. A simple check for file count could help, but for now I find myself running it a few times just to see if I can grab the image. allenai is home to many such datasets, but I eventually found allenai/paloma a eval dataset, that I could visualize
For some of these larger datasets, I might run things locally and write the image out to see if there are any interesting findings.
To help, I built a Space jsulz/repo-info that lets you search for any repo and get back:
- Treemap of the repository, color coded by file/directory size
- Repo branches and their size
- Cumulative size of different file types (e.g., the total size of all the safetensors in the repo)
And because I'm interested in how this will fit in our work to leverage content-defined chunking for versioning repos on the Hub
- https://huggingface.co./blog/from-files-to-chunks - everything has the number of chunks (1 chunk = 64KB) as well as the total size in bytes.
Some of the treemaps are pretty cool. Attached are black-forest-labs/FLUX.1-dev and for fun laion/laion-audio-preview (which has nearly 10k .tar files π€―)
They Said It Couldnβt Be Done
[bot] Conversion to Parquet
- There will be the first major public protest related to AI
- A big company will see its market cap divided by two or more because of AI
- At least 100,000 personal AI robots will be pre-ordered
- China will start to lead the AI race (as a consequence of leading the open-source AI race).
- There will be big breakthroughs in AI for biology and chemistry.
- We will begin to see the economic and employment growth potential of AI, with 15M AI builders on Hugging Face.
How my predictions for 2024 turned out:
- A hyped AI company will go bankrupt or get acquired for a ridiculously low price
β (Inflexion, AdeptAI,...)
- Open-source LLMs will reach the level of the best closed-source LLMs
β with QwQ and dozens of others
- Big breakthroughs in AI for video, time-series, biology and chemistry
β for video π΄for time-series, biology and chemistry
- We will talk much more about the cost (monetary and environmental) of AI
β Monetary π΄Environmental (π’)
- A popular media will be mostly AI-generated
β with NotebookLM by Google
- 10 millions AI builders on Hugging Face leading to no increase of unemployment
πcurrently 7M of AI builders on Hugging Face
The amazing, new Qwen/Qwen2.5-Coder-32B-Instruct model can now write SQL for any Hugging Face dataset β¨
It's 2025, you shouldn't be hand writing SQL! This is a big step in making it where anyone can do in depth analysis on a dataset. Let us know what you think π€