AsoSoft/AsoSoft-Text-Corpus

AsoSoft Text Corpus is the first large scale text corpus for the Kurdish language.

/ 100

Experimental

This provides the first large-scale collection of Kurdish language text, specifically for the Central Kurdish (Sorani) dialect. It takes raw Kurdish text from various sources, cleans and standardizes it through a detailed normalization process, and outputs a massive, organized corpus ready for analysis. Linguists, lexicographers, and natural language processing (NLP) researchers working with the Kurdish language would use this resource.

No commits in the last 6 months.

Use this if you need a pre-processed, extensive dataset of Central Kurdish text for linguistic analysis, dictionary creation, or developing applications that understand or generate Kurdish language.

Not ideal if your project involves a different dialect of Kurdish or requires data for commercial purposes, as this corpus is strictly for non-commercial research.

Kurdish-language-research linguistics lexicography NLP-data speech-processing

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 7 / 25

Maturity 8 / 25

Community 7 / 25

How are scores calculated?

Stars

Forks

Language

—

License

—

Higher-rated alternatives

Helsinki-NLP/OpusFilter

OpusFilter - Parallel corpus processing toolkit

natasha/corus

Links to Russian corpora + Python functions for loading and parsing

darija-open-dataset/dataset

darija <-> english dataset

omicsNLP/Auto-CORPus

Auto-CORPus pipeline developed by a University of Nottingham and Imperial College London...

SergeyShk/ruTS

Библиотека для извлечения статистик из текстов на русском языке.

Explore NLP Tools

All categories Trending NLP directory Insights