luciusssss/mc2_corpus

[ACL'24] MC^2: A Multilingual Corpus of Minority Languages in China (Tibetan, Uyghur, Kazakh, and Mongolian)

/ 100

Emerging

This project offers MC^2, a large-scale collection of text data for four minority languages in China: Tibetan, Uyghur, Kazakh (in Arabic script), and Mongolian (in traditional script). It provides high-quality, culturally-aware datasets crucial for improving the understanding and performance of language models in these underrepresented languages. Researchers, linguists, and anyone developing natural language processing applications for these specific languages would use this.

Use this if you are developing or evaluating language technologies and need extensive, high-quality text data for Tibetan, Uyghur, Kazakh (Arabic script), or Mongolian (traditional script).

Not ideal if your focus is on widely-resourced languages or if you require data for other minority languages not included in this specific collection.

minority-language-research NLP-development linguistic-studies text-corpus cultural-linguistics

No Package No Dependents

Maintenance 10 / 25

Adoption 7 / 25

Maturity 16 / 25

Community 6 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

CC0-1.0

Related tools

Scagin/CCTC

文言文翻译、古文翻译语料数据集

Explore NLP Tools

All categories Trending NLP directory Insights