56 4 18

nyuuzyou PRO

nyuuzyou

https://ducks.party/donate

AI & ML interests

None yet

Recent Activity

updated a dataset 2 days ago

nyuuzyou/fimfiction

published a dataset 2 days ago

nyuuzyou/fimfiction

posted an update 6 days ago

🌐 Public MediaWiki Collection Dataset - https://huggingface.co./datasets/nyuuzyou/wikis Collection of 1.66M+ articles from 930 public MediaWiki instances featuring: - Full article content from diverse public wikis across the internet - Complete metadata including templates, categories, and section structure - Rich structural information preserving wiki organization and links - Multilingual content across 35+ languages including English, Chinese, Spanish, and more - Regional language variants including US/UK English, Brazilian Portuguese, and Traditional/Simplified Chinese Key contents: - 1,662,448 wiki articles with full text - Extensive metadata including templates, categories, sections - Internal wikilinks and external reference information - Cross-domain knowledge spanning multiple topics and fields

View all activity

Organizations

nyuuzyou's activity

posted an update 6 days ago

Post

496

🌐 Public MediaWiki Collection Dataset - nyuuzyou/wikis

Collection of 1.66M+ articles from 930 public MediaWiki instances featuring:

- Full article content from diverse public wikis across the internet
- Complete metadata including templates, categories, and section structure
- Rich structural information preserving wiki organization and links
- Multilingual content across 35+ languages including English, Chinese, Spanish, and more
- Regional language variants including US/UK English, Brazilian Portuguese, and Traditional/Simplified Chinese

Key contents:
- 1,662,448 wiki articles with full text
- Extensive metadata including templates, categories, sections
- Internal wikilinks and external reference information
- Cross-domain knowledge spanning multiple topics and fields

posted an update 9 days ago

Post

2444

📚 Historical Russian Technical Journal Images Dataset - nyuuzyou/journals

Сollection of digitized pages from vintage Russian technical journals featuring:

- 7.47k high-quality images
- Machine-generated descriptions in Russian
- Valuable historical technical content for image-to-text applications

Content descriptions are dedicated to the public domain under the CC0 1.0 license, allowing unrestricted use without attribution.

posted an update 10 days ago

Post

1972

🌐 Grustnogram Social Media Dataset - nyuuzyou/grustnogram

A comprehensive collection of 597K posts from Grustnogram.ru featuring:

- 597K social media posts with full text and image content (all images are black and white)
- Rich metadata including user IDs, post interactions (likes, comments)
- Content from anonymous text-only posts
- Approximately 278.9 GB of content

Content is dedicated to the public domain under the CC0 1.0 license, allowing unrestricted reuse without attribution or share-alike requirements.

reacted to ngxson's post with 🚀 10 days ago

Post

2898

A comprehensive matrix for which format should you use.

Read more on my blog post: https://huggingface.co./blog/ngxson/common-ai-model-formats

| Hardware        | GGUF      | PyTorch                | Safetensors              | ONNX  |
|-----------------|-----------|------------------------|--------------------------|-------|
| CPU             | ✅ (best) | 🟡                      | 🟡                       | ✅    |
| GPU             | ✅        | ✅                      | ✅                       | ✅    |
| Mobile          | ✅        | 🟡 (via executorch)     | ❌                       | ✅    |
| Apple silicon   | ✅        | 🟡                      | ✅ (via MLX framework)   | ✅    |

1 reply

posted an update 12 days ago

Post

628

🛫 AEX.ru Aviation News Dataset - nyuuzyou/aex

Key contents:
- 249,149 aviation news articles with full text
- Metadata including tags, image captions, and attributions
- URL information for reference
- Russian language content focusing on aviation topics

reacted to stefan-it's post with 👍 13 days ago

Post

5061

She arrived 😍

[Expect more models soon...]

2 replies

reacted to fdaudens's post with ❤️ 17 days ago

Post

5802

🎯 Perplexity drops their FIRST open-weight model on Hugging Face: A decensored DeepSeek-R1 with full reasoning capabilities. Tested on 1000+ examples for unbiased responses.

Check it out: perplexity-ai/r1-1776
Blog post: https://perplexity.ai/hub/blog/open-sourcing-r1-1776

1 reply

posted an update 17 days ago

Post

1299

🌐 Fandom.com Community Dataset - nyuuzyou/fandom

A comprehensive collection of 7.04M wiki pages from Fandom.com communities featuring:
- Full article content and metadata from current pages
- Rich structural data including templates, categories, and links
- Multilingual content across 40+ languages
- Complete metadata including titles and section structure

Content is available under CC-BY-SA 3.0 license, allowing reuse with attribution and share-alike requirements.

Key contents:
- 7.04M wiki articles with full text
- Metadata including templates, categories, sections
- Internal and external link information
- Multi-language support including major world languages

The dataset provides a valuable resource for:
- Text generation and classification tasks
- Topic modeling and categorization
- Cross-language information retrieval
- Wiki structure analysis

All content comes from public Fandom.com community wikis as of February 2025 and maintains original CC-BY-SA 3.0 licensing.

reacted to jsulz's post with 🤗 25 days ago

Post

3058

Toward the end of last year, the Xet team provided an inside look into the foundations of how we plan to enable rapid experimentation and iteration for the AI builders on the Hub: https://huggingface.co./blog/from-files-to-chunks

But it turns out chunks aren't all you need!

Our goal is to bring:
🚀 Faster uploads
⏬ Speedy downloads
💪 All without sacrificing your workflow

To do that, we need the infrastructure and system and design to back it up. As we prepare to roll out the first Xet-backed repositories on the Hub, we wrote up a post explaining the nitty gritty details of the decisions that bring this to life https://huggingface.co./blog/from-chunks-to-blocks

Complete with an interactive visualization that shows the power of deduplication in action - taking a 191GB repo to ~97GB and shaving a few hours off upload speeds.

The darker each block in the heatmap, the more we dedupe, the less we have to transfer. Clicking on a file's blocks shows all other files that share blocks.

Check it out and explore for yourself! xet-team/quantization-dedup

posted an update 25 days ago

Post

1748

🎓 Educational Text Collection - nyuuzyou/edutexts

A collection of 1.38M educational texts featuring:
- 1.33M educational presentations with full slide content
- 47K academic documents with complete text
- Multilingual content (Russian, Ukrainian, English)
- Full metadata including titles and descriptions

All content is available under CC0 license, allowing unrestricted use including commercial applications.

reacted to Xenova's post with 🔥 29 days ago

Post

8883

We did it. Kokoro TTS (v1.0) can now run 100% locally in your browser w/ WebGPU acceleration. Real-time text-to-speech without a server. ⚡️

Generate 10 seconds of speech in ~1 second for $0.

What will you build? 🔥
webml-community/kokoro-webgpu

The most difficult part was getting the model running in the first place, but the next steps are simple:
✂️ Implement sentence splitting, allowing for streamed responses
🌍 Multilingual support (only phonemization left)

Who wants to help?

9 replies

reacted to m-ric's post with 🔥 about 1 month ago

Post

9716

Introducing 𝗼𝗽𝗲𝗻 𝗗𝗲𝗲𝗽-𝗥𝗲𝘀𝗲𝗮𝗿𝗰𝗵 by Hugging Face! 💥

OpenAI's latest agentic app Deep Research seems really good... But it's closed, as usual.

⏱️ So with a team of cracked colleagues, we set ourselves a 24hours deadline to replicate and open-source Deep Research! ⏱️

➡️ We built open-Deep-Research, an entirely open agent that can: navigate the web autonomously, scroll and search through pages, download and manipulate files, run calculation on data...

We aimed for the best performance: are the agent's answers really rigorous?

On GAIA benchmark, Deep Research had 67% accuracy on the validation set.
➡️ open Deep Research is at 55% (powered by o1), it is:
- the best pass@1 solution submitted
- the best open solution 💪💪

And it's only getting started ! Please jump in, drop PRs, and let's bring it to the top !

Read the blog post 👉 https://huggingface.co./blog/open-deep-research

posted an update about 1 month ago

Post

2468

📱 UI Navigation Corpus - teleren/ui-navigation-corpus

A comprehensive collection of mobile and web UI elements created by a new member of the Hugging Face community @teleren . I'm glad that I was able to provide a little help together with @its5Q to get this dataset published.

This dataset contains:
- Screenshots and recordings of mobile (iOS/Android) and web interfaces
- UI navigation annotations and metadata
- Screen categorization tags and text extractions
- Navigation paths and screen relationships
- Version control for UI imagery

Perfect for training UI navigation agents and understanding interface patterns. The dataset provides detailed annotations linking screens, sections, and navigation flows together.

reacted to nroggendorff's post with 👀 about 1 month ago

Post

1812

minor ui update, who dis?

1 reply

reacted to fdaudens's post with ❤️ about 1 month ago

Post

8746

Yes, DeepSeek R1's release is impressive. But the real story is what happened in just 7 days after:

- Original release: 8 models, 540K downloads. Just the beginning...

- The community turned those open-weight models into +550 NEW models on Hugging Face. Total downloads? 2.5M—nearly 5X the originals.

The reason? DeepSeek models are open-weight, letting anyone build on top of them. Interesting to note that the community focused on quantized versions for better efficiency & accessibility. They want models that use less memory, run faster, and are more energy-efficient.

When you empower builders, innovation explodes. For everyone. 🚀

The most popular community model? @bartowski 's DeepSeek-R1-Distill-Qwen-32B-GGUF version — 1M downloads alone.

4 replies

reacted to clem's post with 🤗 about 1 month ago

Post

7224

AI is not a zero-sum game. Open-source AI is the tide that lifts all boats!

reacted to hexgrad's post with ❤️ about 1 month ago

Post

3937

IMHO, being able & willing to defeat CAPTCHA, hCaptcha, or any other reasoning puzzle is a must-have for any Web-Browsing / Computer-Using Agent (WB/CUA).

I realize it subverts the purpose of CAPTCHA, but I do not think you can claim to be building AGI/agents without smoothly passing humanity checks. It would be like getting in a self-driving car that requires human intervention over speed bumps. Claiming AGI or even "somewhat powerful AI" seems hollow if you are halted by a mere CAPTCHA.

I imagine OpenAI's Operator is *able* but *not willing* to defeat CAPTCHA. Like their non-profit status, I expect that policy to evolve over time—and if not, rival agent-builders will attack that opening to offer a better product.

2 replies

posted an update about 2 months ago

Post

461

🤗Emojis Dataset - nyuuzyou/emojis

A collection of metadata for 3,264,372 AI-generated emoji images featuring:
- URLs to AI-generated emoji artwork images
- Links to both full-resolution transparent PNGs and compressed WebP formats
- Unique identifiers and slugs for each emoji entry
- Original prompts

posted an update about 2 months ago

Post

1495

🤖 Begemot.ai Dataset - nyuuzyou/begemot

A collection of 2,728,999 AI-generated educational projects featuring:
- Comprehensive Russian language educational content
- Complete project metadata including titles, descriptions and chapters
- Educational project descriptions and content
- Direct URLs to project pages
- Project titles and detailed descriptions

All content is available under CC0 license, allowing unrestricted use including commercial applications.

posted an update about 2 months ago

Post

1694

🎨 Artfol Dataset - nyuuzyou/artfol

A collection of 1,892,816 artwork posts featuring:
- High-quality art pieces with various styles and techniques
- Complete metadata including artist IDs, titles, and moderation flags
- Content from Artfol social media platform

The dataset contains:
- Public domain artwork posts
- Artist attribution and identifiers
- Direct image URLs and web page links
- Content safety flags (NSFW, gore)
- Post titles and descriptions

All content is available under CC0 license, allowing unrestricted use including commercial applications.