luciusssss/mc2_corpus

[ACL'24] MC^2: A Multilingual Corpus of Minority Languages in China (Tibetan, Uyghur, Kazakh, and Mongolian)

39
/ 100
Emerging

This project offers MC^2, a large-scale collection of text data for four minority languages in China: Tibetan, Uyghur, Kazakh (in Arabic script), and Mongolian (in traditional script). It provides high-quality, culturally-aware datasets crucial for improving the understanding and performance of language models in these underrepresented languages. Researchers, linguists, and anyone developing natural language processing applications for these specific languages would use this.

Use this if you are developing or evaluating language technologies and need extensive, high-quality text data for Tibetan, Uyghur, Kazakh (Arabic script), or Mongolian (traditional script).

Not ideal if your focus is on widely-resourced languages or if you require data for other minority languages not included in this specific collection.

minority-language-research NLP-development linguistic-studies text-corpus cultural-linguistics
No Package No Dependents
Maintenance 10 / 25
Adoption 7 / 25
Maturity 16 / 25
Community 6 / 25

How are scores calculated?

Stars

31

Forks

2

Language

Python

License

CC0-1.0

Last pushed

Jan 17, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/luciusssss/mc2_corpus"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.