Dataset Tools

community

AI & ML interests

Tools for creating and exploring datasets

Recent Activity

Dataset-Tools's activity

fdaudens 
posted an update about 6 hours ago
view post
Post
347
Honored to be named among their 12 pioneers and power players in the news industry in the 2025 Tech Trends Report from Future Today Strategy Group.

Incredible group to be part of - each person is doing groundbreaking work at the intersection of AI and journalism. Worth following them all: they're consistently sharing practical insights on building the future of news.

Take the time to read this report, it's packed with insights as always. The news & information section's #1 insight hits hard: "The most substantive economic impact of AI to date has been licensing payouts for a handful of big publishers. The competition will start shifting in the year ahead to separate AI 'haves' that have positioned themselves to grow from the 'have-nots.'"

This AI-driven divide is something I've been really concerned about. Now is the time to build more than ever!

👉 Full report here: https://ftsg.com/wp-content/uploads/2025/03/FTSG_2025_TR_FINAL_LINKED.pdf
  • 1 reply
·
Tonic 
posted an update 3 days ago
view post
Post
741
🙋🏻‍♂️Hey there folks,

Did you know that you can use ModernBERT to detect model hallucinations ?

Check out the Demo : Tonic/hallucination-test

See here for Medical Context Demo : MultiTransformer/tonic-discharge-guard

check out the model from KRLabs : KRLabsOrg/lettucedect-large-modernbert-en-v1

and the library they kindly open sourced for it : https://github.com/KRLabsOrg/LettuceDetect

👆🏻if you like this topic please contribute code upstream 🚀

  • 2 replies
·
fdaudens 
posted an update 3 days ago
view post
Post
3610
AI will bring us "a country of yes-men on servers" instead of one of "Einsteins sitting in a data center" if we continue on current trends.

Must-read by @thomwolf deflating overblown AI promises and explaining what real scientific breakthroughs require.

https://thomwolf.io/blog/scientific-ai.html
  • 2 replies
·
davidberenstein1957 
posted an update 4 days ago
Tonic 
posted an update 4 days ago
view post
Post
568
Powered by KRLabsOrg/lettucedect-large-modernbert-en-v1 from KRLabsOrg.

Detect hallucinations in answers based on context and questions using ModernBERT with 8192-token context support!

### Model Details
- **Model Name**: [lettucedect-large-modernbert-en-v1]( KRLabsOrg/lettucedect-large-modernbert-en-v1)
- **Organization**: [KRLabsOrg](https://huggingface.co./KRLabsOrg)
- **Github**: [https://github.com/KRLabsOrg/LettuceDetect](https://github.com/KRLabsOrg/LettuceDetect)
- **Architecture**: ModernBERT (Large) with extended context support up to 8192 tokens
- **Task**: Token Classification / Hallucination Detection
- **Training Dataset**: [RagTruth]( wandb/RAGTruth-processed)
- **Language**: English
- **Capabilities**: Detects hallucinated spans in answers, provides confidence scores, and calculates average confidence across detected spans.

LettuceDetect excels at processing long documents to determine if an answer aligns with the provided context, making it a powerful tool for ensuring factual accuracy.
prithivMLmods 
posted an update 4 days ago
davidberenstein1957 
posted an update 5 days ago
view post
Post
4083
🥊 Epic Agent Framework Showdown! Available today!

🔵 In the blue corner, the versatile challenger with a proven track record of knowledge retrieval: LlamaIndex!

🛑 In the red corner, the defender, weighing in with lightweight efficiency: Hugging Face smolagents!

🔗 URL: https://huggingface.co./agents-course

We just published the LlamaIndex unit for the agents course, and it is set to offer a great contrast between the smolagents unit by looking at

- What makes llama-index stand-out
- How the LlamaHub is used for integrations
- Creating QueryEngine components
- Using agents and tools
- Agentic and multi-agent workflows

The team has been working flat-out on this for a few weeks. Supported by Logan Markewich and Laurie Voss over at LlamaIndex.

Who won? You decide!
davidberenstein1957 
posted an update 5 days ago
view post
Post
2920
🫸 New release to push vector search to the Hub with vicinity and work with any serialisable objects.

🧑‍🏫 KNN, HNSW, USEARCH, ANNOY, PYNNDESCENT, FAISS, and VOYAGER.

🔗 Example Repo: minishlab/my-vicinity-repo
zamal 
posted an update 7 days ago
view post
Post
1877
🚀 ftBoost is LIVE – Stop Struggling with Fine-Tuning Data!

Alright folks, if you’re tired of manually crafting fine-tuning datasets, ftBoost is here to do the heavy lifting. One-click, LangChain-Groq-powered data augmentation that scales your training data in OpenAI, Gemini, Mistral, and LLaMA formats—automatically.

🔥 What’s inside?
✅ Smart Augmentations – Paraphrasing, back translation, synonym swapping & synthetic noise.
✅ No more JSONL headaches – Auto-formats everything for OpenAI, Gemini, Mistral & LLaMA.
✅ Custom tuning – Adjust similarity, diversity, and fluency in real-time.
✅ Upload, generate, download – That’s it.

⚡ If you’re fine-tuning LLMs, this will save you hours.

🚀 Try it now: 👉 zamal/Finetune-Boost

🌟 Give us a star on GitHub!

Let me know what you think & how it boosts your workflow! 🔥
fdaudens 
posted an update 9 days ago
view post
Post
3397
What if AI becomes as ubiquitous as the internet, but runs locally and transparently on our devices?

Fascinating TED talk by @thomwolf on open source AI and its future impact.

Imagine this for AI: instead of black box models running in distant data centers, we get transparent AI that runs locally on our phones and laptops, often without needing internet access. If the original team moves on? No problem - resilience is one of the beauties of open source. Anyone (companies, collectives, or individuals) can adapt and fix these models.

This is a compelling vision of AI's future that solves many of today's concerns around AI transparency and centralized control.

Watch the full talk here: https://www.ted.com/talks/thomas_wolf_what_if_ai_just_works
  • 1 reply
·
davanstrien 
posted an update 9 days ago
view post
Post
2603
📊 Introducing "Hugging Face Dataset Spotlight" 📊

I'm excited to share the first episode of our AI-generated podcast series focusing on nice datasets from the Hugging Face Hub!

This first episode explores mathematical reasoning datasets:

- SynthLabsAI/Big-Math-RL-Verified: Over 250,000 rigorously verified problems spanning multiple difficulty levels and mathematical domains
- open-r1/OpenR1-Math-220k: 220,000 math problems with multiple reasoning traces, verified for accuracy using Math Verify and Llama-3.3-70B models.
- facebook/natural_reasoning: 1.1 million general reasoning questions carefully deduplicated and decontaminated from existing benchmarks, showing superior scaling effects when training models like Llama3.1-8B-Instruct.

Plus a bonus segment on bespokelabs/bespoke-manim!

https://www.youtube.com/watch?v=-TgmRq45tW4
davanstrien 
posted an update 10 days ago
view post
Post
3575
Quick POC: Turn a Hugging Face dataset card into a short podcast introducing the dataset using all open models.

I think I'm the only weirdo who would enjoy listening to something like this though 😅

Here is an example for eth-nlped/stepverify
  • 2 replies
·
fdaudens 
posted an update 11 days ago
view post
Post
3060
Is this the best tool to extract clean info from PDFs, handwriting and complex documents yet?

Open source olmOCR just dropped and the results are impressive.

Tested the free demo with various documents, including a handwritten Claes Oldenburg letter. The speed is impressive: 3000 tokens/second on your own GPU - that's 1/32 the cost of GPT-4o ($190/million pages). Game-changer for content extraction and digital archives.

To achieve this, Ai2 trained a 7B vision language model on 260K pages from 100K PDFs using "document anchoring" - combining PDF metadata with page images.

Best part: it actually understands document structure (columns, tables, equations) instead of just jumbling everything together like most OCR tools. Their human eval results back this up.

👉 Try the demo: https://olmocr.allenai.org

Going right into the AI toolkit: JournalistsonHF/ai-toolkit
  • 3 replies
·
prithivMLmods 
posted an update 12 days ago
view post
Post
5823
Dropping some of the custom fine-tunes based on SigLIP2,
with a single-label classification problem type! 🌀🧤

- AI vs Deepfake vs Real : prithivMLmods/AI-vs-Deepfake-vs-Real-Siglip2
- Deepfake Detect : prithivMLmods/Deepfake-Detect-Siglip2
- Fire Detection : prithivMLmods/Fire-Detection-Siglip2
- Deepfake Quality Assess : prithivMLmods/Deepfake-Quality-Assess-Siglip2
- Guard Against Unsafe Content : prithivMLmods/Guard-Against-Unsafe-Content-Siglip2

🌠Collection : prithivMLmods/siglip2-custom-67bcdb2de8fe96b99fb4e19e
fdaudens 
posted an update 13 days ago
view post
Post
3262
🚀 Just launched: A toolkit of 20 powerful AI tools that journalists can use right now - transcribe, analyze, create. 100% free & open-source.

Been testing all these tools myself and created a searchable collection of the most practical ones - from audio transcription to image generation to document analysis. No coding needed, no expensive subscriptions.

Some highlights I've tested personally:
- Private, on-device transcription with speaker ID in 100+ languages using Whisper
- Website scraping that just works - paste a URL, get structured data
- Local image editing with tools like Finegrain (impressive results)
- Document chat using Qwen 2.5 72B (handles technical papers well)

Sharing this early because the best tools come from the community. Drop your favorite tools in the comments or join the discussion on what to add next!

👉 JournalistsonHF/ai-toolkit
prithivMLmods 
posted an update 15 days ago
view post
Post
5810
It's really interesting about the deployment of a new state of matter in Majorana 1: the world’s first quantum processor powered by topological qubits. If you missed this news this week, here are some links for you:

🅱️Topological qubit arrays: https://arxiv.org/pdf/2502.12252

⚛️ Quantum Blog: https://azure.microsoft.com/en-us/blog/quantum/2025/02/19/microsoft-unveils-majorana-1-the-worlds-first-quantum-processor-powered-by-topological-qubits/

📖 Read the story: https://news.microsoft.com/source/features/innovation/microsofts-majorana-1-chip-carves-new-path-for-quantum-computing/

📝 Majorana 1 Intro: https://youtu.be/Q4xCR20Dh1E?si=Z51DbEYnZFp_88Xp

🌀The Path to a Million Qubits: https://youtu.be/wSHmygPQukQ?si=TS80EhI62oWiMSHK
·
fdaudens 
posted an update 16 days ago
davanstrien 
posted an update 17 days ago
view post
Post
2571
Hacked together a way to log trl GRPO training completions to a 🤗 dataset repo. This allows you to:

- Track rewards from multiple reward functions
- Treat the completion and rewards from training as a "proper" dataset and do EDA
- Share results for open science

The implementation is super hacky, but I'm curious if people would find this useful.

To push completions to the Hub, you just need two extra parameters:

log_completions=True
log_completions_hub_repo='your-username/repo-name'

Example dataset: davanstrien/test-logs
Colab: https://colab.research.google.com/drive/1wzBFPVthRYYTp-mEYlznLg_e_0Za1M3g

alielfilali01 
posted an update 18 days ago
view post
Post
772
🚨 Arabic LLM Evaluation 🚨

Few models join the ranking of inceptionai/AraGen-Leaderboard Today.

The new MistralAI model, Saba, is quite impressive, Top10 ! Well done @arthurmensch and team.

Sadly Mistral did not follow its strategy about public weights this time, we hope this changes soon and we get the model with a permissive license.

We added other Mistral models and apparently, we have been sleeping on mistralai/Mistral-Large-Instruct-2411 !

Another impressive model that joined the ranking today is ALLaM-AI/ALLaM-7B-Instruct-preview. After a long wait finally ALLaM is here and it is IMPRESSIVE given its size !

ALLaM is ranked on OALL/Open-Arabic-LLM-Leaderboard as well.
fdaudens 
posted an update 19 days ago