Jared Sulzdorf's picture

Jared Sulzdorf PRO

jsulz

AI & ML interests

NLP + (Law|Medicine) & Ethics

Recent Activity

View all activity

Articles

Organizations

Hugging Face's profile picture Spaces Examples's profile picture Blog-explorers's profile picture Journalists on Hugging Face's profile picture Hugging Face Discord Community's profile picture Xet Team's profile picture open/ acc's profile picture

jsulz's activity

reacted to julien-c's post with πŸ€—β€οΈπŸ”₯ 15 days ago
view post
Post
7616
After some heated discussion πŸ”₯, we clarify our intent re. storage limits on the Hub

TL;DR:
- public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible
- private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)

docs: https://huggingface.co./docs/hub/storage-limits

We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community πŸ”₯

cc: @reach-vb @pierric @victor and the HF team
Β·
reacted to dvilasuero's post with πŸ”₯❀️ 19 days ago
view post
Post
2261
🌐 Announcing Global-MMLU: an improved MMLU Open dataset with evaluation coverage across 42 languages, built with Argilla and the Hugging Face community.

Global-MMLU is the result of months of work with the goal of advancing Multilingual LLM evaluation. It's been an amazing open science effort with collaborators from Cohere For AI, Mila - Quebec Artificial Intelligence Institute, EPFL, Massachusetts Institute of Technology, AI Singapore, National University of Singapore, KAIST, Instituto Superior TΓ©cnico, Carnegie Mellon University, CONICET, and University of Buenos Aires.

🏷️ +200 contributors used Argilla MMLU questions where regional, dialect, or cultural knowledge was required to answer correctly. 85% of the questions required Western-centric knowledge!

Thanks to this annotation process, the open dataset contains two subsets:

1. πŸ—½ Culturally Agnostic: no specific regional, cultural knowledge is required.
2. βš–οΈ Culturally Sensitive: requires dialect, cultural knowledge or geographic knowledge to answer correctly.

Moreover, we provide high quality translations of 25 out of 42 languages, thanks again to the community and professional annotators leveraging Argilla on the Hub.

I hope this will ensure a better understanding of the limitations and challenges for making open AI useful for many languages.

Dataset: CohereForAI/Global-MMLU
reacted to fdaudens's post with 🧠 19 days ago
view post
Post
1373
The viz of the day for the Year in review: Network graph showing likes similarity between models.

Instructive to see which models serve as the "nodes" of the Hub!

Check it out: huggingface/open-source-ai-year-in-review-2024
replied to their post 19 days ago
view reply

I thought big and complex repos would be fun to visualize and they can be! This image is from blanchon/RESISC45, a repo with 31,000 images from Google Earth, each bucketed into one of 45 taxonomies with 700 images per taxonomy:

Screenshot 2024-12-06 at 9.58.49β€―AM.png

But more fun is when you find a repository that is structured (naming conventions and directories) in a way that lets you see the inequity in the bytes.

This is most apparent in NLP datasets that are multilingual, similar to the wikimedia/wikipedia dataset. If you zoom in on any of these (or run them yourself in the Space) you'll see a directory or file naming convention using the language abbreviation. Sections that near yellow for directories or files == more bytes devoted to that language.

Here's facebook/multilingual_librispeech:

newplot (28).png

and mozilla-foundation/common_voice_17_0:

newplot (29).png

and google/xtreme:

newplot (30).png

and unolp/CulturaX:

newplot (31).png

Each dataset shows some imbalance in the languages represented, and this pattern holds true for other types of datasets as well. However, such discrepancies can be harder to spot when folder or file naming conventions prioritize machine over human readability.

Another fun example is the nguha/legalbench dataset, designed to evaluate legal reasoning in LLMs. It provides a clear view of the types of reasoning being tested:

newplot (32).png

Although you might have to squint to see the labels. This is one where it might be best to head over to the Space https://huggingface.co./spaces/jsulz/repo-info and see it for yourself ;)

replied to their post 20 days ago
view reply

Datasets are among my favorite to visualize because of their mixture of files and folder structures. Here's the huggingface/documentation-images where alongside documentation images we store images for the Hugging Face blog:

I also enjoy the wikimedia/wikipedia dataset. It's fascinating to see the distribution of bytes across languages.

Some datasets are actually quite difficult to visualize because the number of points in the Plotly graph cause the browser to crash on render. It's quite possible you'll run into this if you use the Space. A simple check for file count could help, but for now I find myself running it a few times just to see if I can grab the image. allenai is home to many such datasets, but I eventually found allenai/paloma a eval dataset, that I could visualize

For some of these larger datasets, I might run things locally and write the image out to see if there are any interesting findings.

posted an update 20 days ago
view post
Post
1295
Doing a lot of benchmarking and visualization work, which means I'm always searching for interesting repos in terms of file types, size, branches, and overall structure.

To help, I built a Space jsulz/repo-info that lets you search for any repo and get back:

- Treemap of the repository, color coded by file/directory size
- Repo branches and their size
- Cumulative size of different file types (e.g., the total size of all the safetensors in the repo)

And because I'm interested in how this will fit in our work to leverage content-defined chunking for versioning repos on the Hub
- https://huggingface.co./blog/from-files-to-chunks - everything has the number of chunks (1 chunk = 64KB) as well as the total size in bytes.

Some of the treemaps are pretty cool. Attached are black-forest-labs/FLUX.1-dev and for fun laion/laion-audio-preview (which has nearly 10k .tar files 🀯)

  • 2 replies
Β·
upvoted an article 21 days ago
New activity in jsulz/jsulz 22 days ago
reacted to clem's post with πŸ”₯πŸš€ 22 days ago
view post
Post
4108
Six predictions for AI in 2025 (and a review of how my 2024 predictions turned out):

- There will be the first major public protest related to AI
- A big company will see its market cap divided by two or more because of AI
- At least 100,000 personal AI robots will be pre-ordered
- China will start to lead the AI race (as a consequence of leading the open-source AI race).
- There will be big breakthroughs in AI for biology and chemistry.
- We will begin to see the economic and employment growth potential of AI, with 15M AI builders on Hugging Face.

How my predictions for 2024 turned out:

- A hyped AI company will go bankrupt or get acquired for a ridiculously low price
βœ… (Inflexion, AdeptAI,...)

- Open-source LLMs will reach the level of the best closed-source LLMs
βœ… with QwQ and dozens of others

- Big breakthroughs in AI for video, time-series, biology and chemistry
βœ… for video πŸ”΄for time-series, biology and chemistry

- We will talk much more about the cost (monetary and environmental) of AI
βœ…Monetary πŸ”΄Environmental (😒)

- A popular media will be mostly AI-generated
βœ… with NotebookLM by Google

- 10 millions AI builders on Hugging Face leading to no increase of unemployment
πŸ”œcurrently 7M of AI builders on Hugging Face
Β·
reacted to cfahlgren1's post with πŸ‘πŸ”₯πŸš€ 23 days ago
view post
Post
2991
We just dropped an LLM inside the SQL Console 🀯

The amazing, new Qwen/Qwen2.5-Coder-32B-Instruct model can now write SQL for any Hugging Face dataset ✨

It's 2025, you shouldn't be hand writing SQL! This is a big step in making it where anyone can do in depth analysis on a dataset. Let us know what you think πŸ€—