๐ฌ Chat as a way to query SQL! The Airtrain AI team is happy to share a new Hugging Face Space that lets you interact with Hugging Face Hub datasets using a natural language chatbot. ๐ค
This Space is forked from davidberenstein1957/text-to-sql-hub-datasetsย byย @davidberenstein1957 and features chat capability with improved table naming. The tool works with Hugging Faceโs recently released in-browser DuckDB-based SQL query engine for datasets.
reacted to Salama1429's
post with ๐4 months ago
๐ Introducing the 101 Billion Arabic Words Dataset
๐ Exciting Milestone in Arabic Language Technology! hashtag#NLP hashtag#ArabicLLM hashtag#LanguageModels
๐ Why It Matters: 1. ๐ Large Language Models (LLMs) have brought transformative changes, primarily in English. It's time for Arabic to shine! 2. ๐ฏ This project addresses the critical challenge of bias in Arabic LLMs due to reliance on translated datasets.
๐ Approach: 1. ๐ช Undertook a massive data mining initiative focusing exclusively on Arabic from Common Crawl WET files. 2. ๐งน Employed state-of-the-art cleaning and deduplication processes to maintain data quality and uniqueness.
๐ Impact: 1. ๐ Created the largest Arabic dataset to date with 101 billion words. 2. ๐ Enables the development of Arabic LLMs that are linguistically and culturally accurate. 3. ๐ Sets a global benchmark for future Arabic language research.
Some days ago we found out people are actually using it! So I'll use this post to explain how I built it in case it's useful for the community.
1. I used distilabel's SelfInstruct-inspired task to generate instructions about different math topics. I curated the instructions with Argilla (on Spaces!). 2. Then I used a distilabel Pipeline to build a preference dataset using gpt3.5 as generator and gpt4 as labeller. If I recall correctly I used our JudgeLM implementation (see https://distilabel.argilla.io/latest/technical-reference/tasks/#judgelmtask)
(see the screenshot with the dataset in the Argilla UI)
3. Then I just binarized into chosen, rejected pairs and voilร :
The funny thing is that I used this to do a second DPO run over Notus-7B. I hoped to see an improvement on math/reasoning skills but it actually improved in STEM and Humanities and did worse on Math ๐คฃ .
In conclusion, this dataset was only a quick experiement. I'm happy to see the community found it useful. Data for DPO and fine-tuning are still a mystery, let's unveil these mysteries in 2024 together!
Follow me for the most exciting datasets for LLMs (and maybe some great, small, efficient models). I plan to announce all Argilla open-source work here!