🌌 Analysis of Spaces in Hugging Face

Community Article Published February 23, 2024

image/png

Dataset: Weyaxi/huggingface-spaces-codes

πŸ€— Huggingface Spaces is a very good, easy-to-use platform to publish various types of applications in seconds and host them freely. With so many positive aspects of Huggingface Spaces, there are now more than 180k spaces in the hub.

❓ How was this made?

Three months ago, I decided to clone all spaces and create a dataset out of them while showcasing meaningful statistics.

After a while, when I tried to scrape approximately 160k spaces, I understood that this was nearly impossible and not necessary to that extent. So, I sorted the spaces based on their likes and tried to scrape 20k of them instead of ~160k.

After encountering this issue, I faced another problem: large files hosted on spaces. So, I focused only on code files within the spaces. Thus, I proceeded to only scrape code files.

Interestingly, there were also spaces with 0 likes at the end of the lists! So, if your space had 0 likes and was lucky enough to get into my list, I scraped them too :)

πŸš€ Resulting dataset

After a long period of time, dealing with rate limiting issues, etc., the final result was approximately 130k files and 2 GB of data. Then another problem arose:

The disk usage wasn't much, but there were a significant number of files.

  • I tried traditional techniques, but they failed.
  • I also attempted to upload it using the multi-commit feature in the Huggingface Hub library, but that also failed because it was uploading too few files in one commit.

I then solved the problem by modifying the library code itself!

It turns out that multi-commit uploads a certain amount of files per commit (I don't remember the exact amount), so I changed that amount to 18000.

I then published the dataset here:

https://huggingface.co./datasets/Weyaxi/huggingface-spaces-codes

πŸ“Š Dataset Statistics

After all of this processing, let's dive deep into the statistics. Here is the overview of the statistics of the spaces that I scrabed.

Language File Extension File Counts File Size (MB) Line Counts
Python .py 141,560 1079.0 28,653,744
SQL .sql 21 523.6 645
JavaScript .js 6,790 369.8 2,137,054
Markdown .md 63,237 273.4 3,110,443
HTML .html 1,953 265.8 516,020
C .c 1,320 132.2 3,558,826
Go .go 429 46.3 6,331
CSS .css 3,097 25.6 386,334
C Header .h 2,824 20.4 570,948
C++ .cpp 1,117 15.3 494,939
TypeScript .ts 4,158 14.8 439,551
TSX .tsx 4,273 9.4 306,416
Shell .sh 3,294 5.5 171,943
Perl .pm 92 4.2 128,594
C# .cs 22 3.9 41,265

πŸ–₯️ Language

As we probably expected, Python is the most used language by a significant margin, while Markdown and JavaScript come after that. (Love Python ❀️)

image/png

πŸ“ Size

If we look at the file size chart, we can see that Python takes up more space among the other languages. SQL comes after that, and I think that's because of the database files hosted in Hugging Face Spaces. After SQL, we come across JavaScript again :)

image/png

πŸ“ Line Count

Python's line count is the highest among other languages, but we encounter another language that comes in second place. Our good old friend: C. C has the most line count in Hugging Face Spaces after Python.

image/png

Spaces Staticts

πŸ› οΈ Software Development Kit (SDK)

Huggingface offers many Software Development Kits (SDKs) to create our spaces. We can see from the chart that the awesome Gradio, which I personally love and use, is the most used SDK in Hugging Face Spaces, while Streamlit comes after that.

image/png

πŸ›οΈ License

As we can see from the chart I provided, the majority of spaces have no license, while the MIT and Apache licenses come in first place after that.

image/png