Manel-Hik (Manel ALOUI)

updated a dataset about 1 month ago

OALL/ALRAGE

reacted to joylarkin's post with 🚀 3 months ago

Post

2628

💬 Chat as a way to query SQL! The Airtrain AI team is happy to share a new Hugging Face Space that lets you interact with Hugging Face Hub datasets using a natural language chatbot. 🤗

Start Exploring 👉 airtrain-ai/hf-dataset-chat-to-sql

This Space is forked from davidberenstein1957/text-to-sql-hub-datasets by @davidberenstein1957 and features chat capability with improved table naming. The tool works with Hugging Face’s recently released in-browser DuckDB-based SQL query engine for datasets.

reacted to Salama1429's post with 👍 4 months ago

Post

1423

📚 Introducing the 101 Billion Arabic Words Dataset

🌐 Exciting Milestone in Arabic Language Technology! hashtag#NLP hashtag#ArabicLLM hashtag#LanguageModels

🚀 Why It Matters:
1. 🌟 Large Language Models (LLMs) have brought transformative changes, primarily in English. It's time for Arabic to shine!
2. 🎯 This project addresses the critical challenge of bias in Arabic LLMs due to reliance on translated datasets.

🔍 Approach:
1. 💪 Undertook a massive data mining initiative focusing exclusively on Arabic from Common Crawl WET files.
2. 🧹 Employed state-of-the-art cleaning and deduplication processes to maintain data quality and uniqueness.

📈 Impact:
1. 🏆 Created the largest Arabic dataset to date with 101 billion words.
2. 📝 Enables the development of Arabic LLMs that are linguistically and culturally accurate.
3. 🌍 Sets a global benchmark for future Arabic language research.

🔗 Paper: https://lnkd.in/dGAiaygn
🔗 Dataset: https://lnkd.in/dGTMe5QV

- 🔄 Share your thoughts and let's drive the future of Arabic NLP together!

hashtag#DataScience hashtag#MachineLearning hashtag#ArtificialIntelligence hashtag#Innovation hashtag#ArabicData

New activity in silma-ai/silma-ar-custom-eval 4 months ago

Technical Report

#2 opened 4 months ago by

Manel-Hik

New activity in ClusterlabAi/InstAr-500k 5 months ago

Error while generating the datasets splits after downloading

1

#2 opened 5 months ago by

JadwalAlmaa

New activity in Omartificial-Intelligence-Space/Arabic-llama3.1-lora-FT 5 months ago

Asking for params

6

#1 opened 5 months ago by

Manel-Hik

updated a model 5 months ago

Manel-Hik/distil-whisper-large-v3-ar

Updated Jul 31

updated 2 datasets 5 months ago

Manel-Hik/common_voice_16_AR_pseudo_labelled

Viewer • Updated Jul 31 • 7.57k • 45

ClusterlabAi/InstAr-500k

Viewer • Updated Jul 30 • 481k • 132 • 10

updated a model 5 months ago

Manel-Hik/distil-whisper-small-v1

Updated Jul 30

updated a dataset 5 months ago

Manel-Hik/common_voice_16_1_ar_pseudo_labelled

Viewer • Updated Jul 30 • 7.57k • 65

updated a model 5 months ago

Manel-Hik/distil-whisper-tiny-ar

Updated Jul 30

reacted to alielfilali01's post with 🤗 7 months ago

Post

1983

I'm officially considered #gpu_poor 💀
But I'm #data_rich 😎

liked a model 7 months ago

CohereForAI/aya-23-8B

Text Generation • Updated Oct 30 • 24.4k • 396

upvoted an article 7 months ago

Article

Introducing the Open Arabic LLM Leaderboard

May 14

• 76

authored a paper 8 months ago

101 Billion Arabic Words Dataset

Paper • 2405.01590 • Published Apr 29 • 5

upvoted an article 8 months ago

Article

🦙⚗️ Using Llama3 and distilabel to build fine-tuning datasets

By

•

Jun 4

• 73

New activity in izhx/udever-bloom-1b1 12 months ago

RuntimeError: The model is currently loading, please re-run the query.

1

#2 opened 12 months ago by

Manel-Hik

bloom的基础模型不够强大, 推荐可以用qwen重新训练下, 而且qwen现在基础模型支持8k

3

#1 opened 12 months ago by

hantian

reacted to dvilasuero's post with ❤️ 12 months ago

Post

👋 Hi there!

This is my very first post.

I'll use it to share some old news: a math preference dataset for DPO!

I created this dataset some time ago while we were developing distilabel (https://github.com/argilla-io/distilabel).

Some days ago we found out people are actually using it! So I'll use this post to explain how I built it in case it's useful for the community.

1. I used distilabel's SelfInstruct-inspired task to generate instructions about different math topics. I curated the instructions with Argilla (on Spaces!).
2. Then I used a distilabel Pipeline to build a preference dataset using gpt3.5 as generator and gpt4 as labeller. If I recall correctly I used our JudgeLM implementation (see https://distilabel.argilla.io/latest/technical-reference/tasks/#judgelmtask)

(see the screenshot with the dataset in the Argilla UI)

3. Then I just binarized into chosen, rejected pairs and voilà:

argilla/distilabel-math-preference-dpo

The funny thing is that I used this to do a second DPO run over Notus-7B. I hoped to see an improvement on math/reasoning skills but it actually improved in STEM and Humanities and did worse on Math 🤣 .

In conclusion, this dataset was only a quick experiement. I'm happy to see the community found it useful. Data for DPO and fine-tuning are still a mystery, let's unveil these mysteries in 2024 together!

Follow me for the most exciting datasets for LLMs (and maybe some great, small, efficient models). I plan to announce all Argilla open-source work here!

2 replies

·

Manel ALOUI

AI & ML interests

Recent Activity

Organizations

Manel-Hik's activity

OALL/ALRAGE

Technical Report

Error while generating the datasets splits after downloading

Asking for params

Manel-Hik/distil-whisper-large-v3-ar

Manel-Hik/common_voice_16_AR_pseudo_labelled

ClusterlabAi/InstAr-500k

Manel-Hik/distil-whisper-small-v1

Manel-Hik/common_voice_16_1_ar_pseudo_labelled

Manel-Hik/distil-whisper-tiny-ar

CohereForAI/aya-23-8B

Introducing the Open Arabic LLM Leaderboard

101 Billion Arabic Words Dataset

🦙⚗️ Using Llama3 and distilabel to build fine-tuning datasets

RuntimeError: The model is currently loading, please re-run the query.

bloom的基础模型不够强大, 推荐可以用qwen重新训练下, 而且qwen现在基础模型支持8k