Banerjee's picture

1 2

Banerjee

port8080

·

port8080

AI & ML interests

datasets

Recent Activity

reacted to jsulz's post with 👍 19 days ago

Doing a lot of benchmarking and visualization work, which means I'm always searching for interesting repos in terms of file types, size, branches, and overall structure. To help, I built a Space https://huggingface.co./spaces/jsulz/repo-info that lets you search for any repo and get back: - Treemap of the repository, color coded by file/directory size - Repo branches and their size - Cumulative size of different file types (e.g., the total size of all the safetensors in the repo) And because I'm interested in how this will fit in our work to leverage content-defined chunking for versioning repos on the Hub - https://huggingface.co./blog/from-files-to-chunks - everything has the number of chunks (1 chunk = 64KB) as well as the total size in bytes. Some of the treemaps are pretty cool. Attached are https://huggingface.co./black-forest-labs/FLUX.1-dev and for fun https://huggingface.co./datasets/laion/laion-audio-preview (which has nearly 10k .tar files 🤯)

reacted to jsulz's post with 🔥 19 days ago

Doing a lot of benchmarking and visualization work, which means I'm always searching for interesting repos in terms of file types, size, branches, and overall structure. To help, I built a Space https://huggingface.co./spaces/jsulz/repo-info that lets you search for any repo and get back: - Treemap of the repository, color coded by file/directory size - Repo branches and their size - Cumulative size of different file types (e.g., the total size of all the safetensors in the repo) And because I'm interested in how this will fit in our work to leverage content-defined chunking for versioning repos on the Hub - https://huggingface.co./blog/from-files-to-chunks - everything has the number of chunks (1 chunk = 64KB) as well as the total size in bytes. Some of the treemaps are pretty cool. Attached are https://huggingface.co./black-forest-labs/FLUX.1-dev and for fun https://huggingface.co./datasets/laion/laion-audio-preview (which has nearly 10k .tar files 🤯)

reacted to jsulz's post with 🔥 about 1 month ago

When the XetHub crew joined Hugging Face this fall, @erinys and I started brainstorming how to share our work to replace Git LFS on the Hub. Uploading and downloading large models and datasets takes precious time. That’s where our chunk-based approach comes in. Instead of versioning files (like Git and Git LFS), we version variable-sized chunks of data. For the Hugging Face community, this means: ⏩ Only upload the chunks that changed. 🚀 Download just the updates, not the whole file. 🧠 We store your file as deduplicated chunks In our benchmarks, we found that using CDC to store iterative model and dataset version led to transfer speedups of ~2x, but this isn’t just a performance boost. It’s a rethinking of how we manage models and datasets on the Hub. We're planning on our new storage backend to the Hub in early 2025 - check out our blog to dive deeper, and let us know: how could this improve your workflows? https://huggingface.co./blog/from-files-to-chunks

View all activity

Articles

Rearchitecting Hugging Face Uploads and Downloads

Organizations

models

None public yet

datasets

None public yet