Amir Hossein Kargaran

kargaranamir

AI & ML interests

#NLP, checkout https://huggingface.co./cis-lmu

Recent Activity

liked a dataset 2 days ago
PaDaS-Lab/webfaq
liked a dataset 2 days ago
pszemraj/local-emoji-search-gte
liked a dataset 2 days ago
davanstrien/fineweb-c-all
View all activity

Organizations

Hugging Face's profile picture CIS, LMU Munich's profile picture DH and NLP Lab's profile picture Blog-explorers's profile picture Balochi Machine Learning's profile picture Social Post Explorers's profile picture Hugging Face Discord Community's profile picture Nerdy Face's profile picture SIG on Iranian languages's profile picture

Posts 2

view post
Post
1251
Introducing GlotCC: a new 2TB corpus based on an early 2024 CommonCrawl snapshot with data for 1000+ languages.

šŸ¤— corpus v1: cis-lmu/GlotCC-V1
šŸ± pipeline v3: https://github.com/cisnlp/GlotCC

More details? Stay tuned for our upcoming paper.
More data? In the next version, we plan to include additional snapshots of CommonCrawl.

Limitation: Due to the lower frequency of low-resource languages compared to others, there are sometimes only a few sentences available for very low-resource languages. However, the data volume for English in this version stands at 750GB, and the top 200 languages still have a strong presence in our data (see plot attached; we write the index for every 20 languages, meaning the 10th index is the 200th language).