NLP Corpus Datasets NLP Tools

Curated collections, loaders, and databases of text corpora for NLP research and training. Includes corpus compilation tools, domain-specific annotated datasets, and corpus management systems. Does NOT include tools for corpus analysis, linguistic annotation frameworks, or applications built on top of corpora.

There are 79 nlp corpus datasets tools tracked. 6 score above 50 (established tier). The highest-rated is Helsinki-NLP/OpusFilter at 65/100 with 115 stars.

Get all 79 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=nlp-corpus-datasets&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 Helsinki-NLP/OpusFilter

OpusFilter - Parallel corpus processing toolkit

65
Established
2 natasha/corus

Links to Russian corpora + Python functions for loading and parsing

57
Established
3 SergeyShk/ruTS

Библиотека для извлечения статистик из текстов на русском языке.

52
Established
4 darija-open-dataset/dataset

darija <-> english dataset

52
Established
5 omicsNLP/Auto-CORPus

Auto-CORPus pipeline developed by a University of Nottingham and Imperial...

52
Established
6 texttechnologylab/GerParCor

German Parliamentary Corpus (GerParCor)

50
Established
7 texttechnologylab/UCE

The Unified Corpus Explorer (UCE) for UIMA-annotated Corpora.

49
Emerging
8 natasha/nerus

Large silver standart Russian corpus with NER, morphology and syntax markup

48
Emerging
9 Koziev/NLP_Datasets

My NLP datasets for Russian language

45
Emerging
10 fido-ai/ua-datasets

A collection of datasets for Ukrainian language

44
Emerging
11 bureaucratic-labs/dostoevsky

Sentiment analysis library for russian language

43
Emerging
12 JuliaText/CorpusLoaders.jl

A variety of loaders for various NLP corpora.

41
Emerging
13 M4t1ss/parallel-corpora-tools

Tools for filtering and cleaning parallel and monolingual corpora for...

41
Emerging
14 notesjor/corpusexplorer2.0

Korpuslinguistik war noch nie so einfach...

40
Emerging
15 JonathanReeve/corpus-db

A textual corpus database for the digital humanities.

38
Emerging
16 ericleasemorgan/reader

Distant Reader, a tool for using & understanding a corpus

38
Emerging
17 microsoft/Clandestino

Repository for the Clandestino corpus

37
Emerging
18 adbar/German-NLP

Curated list of open-access/open-source/off-the-shelf resources and tools...

37
Emerging
19 josecannete/spanish-corpora

Unannotated Spanish 3 Billion Words Corpora

37
Emerging
20 t-systems-on-site-services-gmbh/german-wikipedia-text-corpus

This is a german text corpus from Wikipedia. It is cleaned, preprocessed and...

36
Emerging
21 KurdishBLARK/InterdialectCorpus

A parallel corpus of Sorani, Kurmanji and English

36
Emerging
22 yutkin/Lenta.Ru-News-Dataset

Corpus of Russian news articles collected from Lenta.Ru

36
Emerging
23 maxoodf/russian_news_corpus

Russian mass media stemmed texts corpus / Корпус лемматизированных...

36
Emerging
24 practikpharma/PGxCorpus

PGxCorpus, a manually annotated corpus, designed for the extraction of...

35
Emerging
25 ilinguistics/corpus_similarity

Measure the similarity of text corpora for 74 languages

35
Emerging
26 velkadamban/Tamil-Corpus

This nTamil project aims to create a comprehensive and high-quality...

33
Emerging
27 ajithalbus/TamilCorpus

Open Source Tamil Corpus of 58M words

33
Emerging
28 rashiedomar/somali-wikipedia-corpus

Cleaned Somali Wikipedia corpus (~9,500 articles) for NLP, LLM training, and...

33
Emerging
29 madhav1k/OpenCorpus

A multilingual compilation of open-source textual corpora across major &...

32
Emerging
30 microsoft/BrevE-CLaro

Repository for the BrevE and CLaro datasets.

31
Emerging
31 somosnlp/corpus-es

Lista de corpus de PLN en español ✨ #Somos600M: Ayuda a desarrollar IA...

30
Emerging
32 notesjor/CorpusExplorer.Terminal.Console

Erlaubt anderen Programmen/Programmiersprachen den Zugriff auf...

30
Emerging
33 ilinguistics/earthLings

Corpus-based language and dialect mapping

30
Emerging
34 SpydazWebAI-NLP/BasicCorpus2023

A Basic Corpus Object , Giving Positional Encoding / Decoding . ,A Fully...

30
Emerging
35 Digital-Pushkin-Lab/RuAdapt_Word_Lists

Word alignments from Russian-Simple Russian parallel data

30
Emerging
36 juwiragiye/ikirundi

The Ikirundi Corpus Project aims to create a comprehensive collection of...

29
Experimental
37 stdlib-js/datasets-moby-dick

The text of Moby Dick by Herman Melville.

29
Experimental
38 davide-ghidelli-business/OpenCorpus

OpenCorpus is a collection of open-source textual corpora from various...

29
Experimental
39 kateryna-bobrovnyk/ukr-twi-corpus

A corpus of Ukrainian Twitter texts + instructions for downloading and...

28
Experimental
40 kscanne/chichewa

NLP resources for Chichewa

28
Experimental
41 d0rj/RusLit

📚 A small collection of Russian literature 📚

28
Experimental
42 DFKI-NLP/product-corpus

This repository contains the DFKI Product Corpus, a dataset of 174 documents...

27
Experimental
43 Kartikaggarwal98/Indian_ParallelCorpus

Curated list of publicly available parallel corpus for Indian Languages

27
Experimental
44 AlexKly/Detailed-NER-Dataset-RU

Labeled Russian text token-by-token for training models for NER task based...

26
Experimental
45 gambolputty/textstelle

Textstelle is a collection of corpora for the creation of bots and other...

25
Experimental
46 Digital-Pushkin-Lab/RuAdapt

A Parallel Russian-Simple Russian Dataset

24
Experimental
47 SaiedAlshahrani/performance-implications

Performance Implications of Using Unrepresentative Corpora in Arabic Natural...

23
Experimental
48 karen-pal/borges

Datasets de los textos de cuentos de varios autorxs latinoamericanxs....

23
Experimental
49 derintelligence/en-az-parallel-corpus

English-Azerbaijani parallel language corpus

22
Experimental
50 AsoSoft/AsoSoft-Text-Corpus

AsoSoft Text Corpus is the first large scale text corpus for the Kurdish language.

22
Experimental
51 DOLMA-NLP/PARME

Parallel corpora for Middle Eastern languages - ACL2025

22
Experimental
52 mannefedov/ru_kw_eval_datasets

Datasets for evaluation of keyword extraction in Russian

21
Experimental
53 mideind/GreynirCorpus

A large treebank of parsed Icelandic text

20
Experimental
54 KurdishBLARK/KTC

Kurdish Textbooks Corpus

20
Experimental
55 madrugado/gia-corpus

Corpus of exam tests for 9-graders in Russia for NLP/ML purposes

20
Experimental
56 steventan0110/align-filter

Repository for "Bitext Mining for Low-Resource Languages via Contrastive Learning"

20
Experimental
57 lirondos/coalas

COrpus of AngLicisms in the SpAnish PresS (COALAS) 🐨

19
Experimental
58 ixa-ehu/cometa

Website of the CoMeta, a Corpus for Metaphor Detection in Spanish

19
Experimental
59 NLP-UMUTeam/Spanish-MisoCorpus-2020

Spanish MisoCorpus 2020

19
Experimental
60 Ofis-publik-ar-brezhoneg/breton-french-corpus

Korpus divyezhek brezhoneg-galleg - Bilingual Breton-French corpus

19
Experimental
61 SaiedAlshahrani/Wikipedia-Corpora-Report

Wikipedia Corpora Meta Report: A Metadata Report of How Wikipedia Editions...

18
Experimental
62 TianciGao/RussScholar-Seeker

RussScholar-Seeker:A Python package for predicting whether a name is Russian...

18
Experimental
63 CoffeBank/Ru-hard-detection-dataset

Ru AI-text detection dataset / Русскоязычный датасет для оценки детекции...

18
Experimental
64 hassanzadehmahdi/BioPersianWikiAnalyzer

Persian Wikipedia Bioinformatics Page Crawler and Text Preprocessor

17
Experimental
65 mosesab/Corpus-based-synonym-finder

Finds the synonym of words in a language using a language corpus

17
Experimental
66 progmatix21/Chilka

A corpus server library with a document database backend.

17
Experimental
67 mmarmonier/ACReFOSC

A companion repository to the French OLDI Seed Corpus. This repository...

14
Experimental
68 josealzugaray/cayetana-corpus

NLP analysis of 107 political speech transcripts (TEI-XML corpus) — topic...

14
Experimental
69 SemiringInc/Mueller-Report-Corpus

The Mueller Report Corpus V 0.1

13
Experimental
70 juletx/corpus-linguistics

Corpus Linguistics slides, labs, assignments and data

12
Experimental
71 bdar-lab/heb_architecture_corpus

Cleaned, parsed, and analyzed Hebrew textual corpus of documents pertaining...

12
Experimental
72 BrightXiaoHan/Yitextor

Parallel corpus processing toolkit forked from...

11
Experimental
73 Uyghur-Corpus/Uyghur-Corpus

Large-scale Uyghur corpus optimized for Large Language Models (LLMs) and NLP...

11
Experimental
74 peghaz/corpora-intersector

High-performance tool to find corpus words missing from large texts using...

11
Experimental
75 Tentakl3/Spanish-NLP-preprocessing

Customized tokenization and preprocessing of Natural Language in Spanish -...

11
Experimental
76 techiaith/corpws-meincnodi-rhannau-ymadrodd

Corpws ar gyfer meincnodi tagwyr rhannau ymadrodd Cymraeg | A corpus for...

11
Experimental
77 DusunDictionary/dusun-english-malay-corpus

A linguistic corpus of the Dusun language for NMT and LLM training

11
Experimental
78 CesarJNP/Depression-corpus-spanish

Spanish depression-labeled corpus (0/1)

11
Experimental
79 mirfan899/SpaCy3Urdu

Build Urdu SpaCy Model

11
Experimental