NLP Corpus Datasets NLP Tools
Curated collections, loaders, and databases of text corpora for NLP research and training. Includes corpus compilation tools, domain-specific annotated datasets, and corpus management systems. Does NOT include tools for corpus analysis, linguistic annotation frameworks, or applications built on top of corpora.
There are 79 nlp corpus datasets tools tracked. 6 score above 50 (established tier). The highest-rated is Helsinki-NLP/OpusFilter at 65/100 with 115 stars.
Get all 79 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=nlp-corpus-datasets&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
Helsinki-NLP/OpusFilter
OpusFilter - Parallel corpus processing toolkit |
|
Established |
| 2 |
natasha/corus
Links to Russian corpora + Python functions for loading and parsing |
|
Established |
| 3 |
SergeyShk/ruTS
Библиотека для извлечения статистик из текстов на русском языке. |
|
Established |
| 4 |
darija-open-dataset/dataset
darija <-> english dataset |
|
Established |
| 5 |
omicsNLP/Auto-CORPus
Auto-CORPus pipeline developed by a University of Nottingham and Imperial... |
|
Established |
| 6 |
texttechnologylab/GerParCor
German Parliamentary Corpus (GerParCor) |
|
Established |
| 7 |
texttechnologylab/UCE
The Unified Corpus Explorer (UCE) for UIMA-annotated Corpora. |
|
Emerging |
| 8 |
natasha/nerus
Large silver standart Russian corpus with NER, morphology and syntax markup |
|
Emerging |
| 9 |
Koziev/NLP_Datasets
My NLP datasets for Russian language |
|
Emerging |
| 10 |
fido-ai/ua-datasets
A collection of datasets for Ukrainian language |
|
Emerging |
| 11 |
bureaucratic-labs/dostoevsky
Sentiment analysis library for russian language |
|
Emerging |
| 12 |
JuliaText/CorpusLoaders.jl
A variety of loaders for various NLP corpora. |
|
Emerging |
| 13 |
M4t1ss/parallel-corpora-tools
Tools for filtering and cleaning parallel and monolingual corpora for... |
|
Emerging |
| 14 |
notesjor/corpusexplorer2.0
Korpuslinguistik war noch nie so einfach... |
|
Emerging |
| 15 |
JonathanReeve/corpus-db
A textual corpus database for the digital humanities. |
|
Emerging |
| 16 |
ericleasemorgan/reader
Distant Reader, a tool for using & understanding a corpus |
|
Emerging |
| 17 |
microsoft/Clandestino
Repository for the Clandestino corpus |
|
Emerging |
| 18 |
adbar/German-NLP
Curated list of open-access/open-source/off-the-shelf resources and tools... |
|
Emerging |
| 19 |
josecannete/spanish-corpora
Unannotated Spanish 3 Billion Words Corpora |
|
Emerging |
| 20 |
t-systems-on-site-services-gmbh/german-wikipedia-text-corpus
This is a german text corpus from Wikipedia. It is cleaned, preprocessed and... |
|
Emerging |
| 21 |
KurdishBLARK/InterdialectCorpus
A parallel corpus of Sorani, Kurmanji and English |
|
Emerging |
| 22 |
yutkin/Lenta.Ru-News-Dataset
Corpus of Russian news articles collected from Lenta.Ru |
|
Emerging |
| 23 |
maxoodf/russian_news_corpus
Russian mass media stemmed texts corpus / Корпус лемматизированных... |
|
Emerging |
| 24 |
practikpharma/PGxCorpus
PGxCorpus, a manually annotated corpus, designed for the extraction of... |
|
Emerging |
| 25 |
ilinguistics/corpus_similarity
Measure the similarity of text corpora for 74 languages |
|
Emerging |
| 26 |
velkadamban/Tamil-Corpus
This nTamil project aims to create a comprehensive and high-quality... |
|
Emerging |
| 27 |
ajithalbus/TamilCorpus
Open Source Tamil Corpus of 58M words |
|
Emerging |
| 28 |
rashiedomar/somali-wikipedia-corpus
Cleaned Somali Wikipedia corpus (~9,500 articles) for NLP, LLM training, and... |
|
Emerging |
| 29 |
madhav1k/OpenCorpus
A multilingual compilation of open-source textual corpora across major &... |
|
Emerging |
| 30 |
microsoft/BrevE-CLaro
Repository for the BrevE and CLaro datasets. |
|
Emerging |
| 31 |
somosnlp/corpus-es
Lista de corpus de PLN en español ✨ #Somos600M: Ayuda a desarrollar IA... |
|
Emerging |
| 32 |
notesjor/CorpusExplorer.Terminal.Console
Erlaubt anderen Programmen/Programmiersprachen den Zugriff auf... |
|
Emerging |
| 33 |
ilinguistics/earthLings
Corpus-based language and dialect mapping |
|
Emerging |
| 34 |
SpydazWebAI-NLP/BasicCorpus2023
A Basic Corpus Object , Giving Positional Encoding / Decoding . ,A Fully... |
|
Emerging |
| 35 |
Digital-Pushkin-Lab/RuAdapt_Word_Lists
Word alignments from Russian-Simple Russian parallel data |
|
Emerging |
| 36 |
juwiragiye/ikirundi
The Ikirundi Corpus Project aims to create a comprehensive collection of... |
|
Experimental |
| 37 |
stdlib-js/datasets-moby-dick
The text of Moby Dick by Herman Melville. |
|
Experimental |
| 38 |
davide-ghidelli-business/OpenCorpus
OpenCorpus is a collection of open-source textual corpora from various... |
|
Experimental |
| 39 |
kateryna-bobrovnyk/ukr-twi-corpus
A corpus of Ukrainian Twitter texts + instructions for downloading and... |
|
Experimental |
| 40 |
kscanne/chichewa
NLP resources for Chichewa |
|
Experimental |
| 41 |
d0rj/RusLit
📚 A small collection of Russian literature 📚 |
|
Experimental |
| 42 |
DFKI-NLP/product-corpus
This repository contains the DFKI Product Corpus, a dataset of 174 documents... |
|
Experimental |
| 43 |
Kartikaggarwal98/Indian_ParallelCorpus
Curated list of publicly available parallel corpus for Indian Languages |
|
Experimental |
| 44 |
AlexKly/Detailed-NER-Dataset-RU
Labeled Russian text token-by-token for training models for NER task based... |
|
Experimental |
| 45 |
gambolputty/textstelle
Textstelle is a collection of corpora for the creation of bots and other... |
|
Experimental |
| 46 |
Digital-Pushkin-Lab/RuAdapt
A Parallel Russian-Simple Russian Dataset |
|
Experimental |
| 47 |
SaiedAlshahrani/performance-implications
Performance Implications of Using Unrepresentative Corpora in Arabic Natural... |
|
Experimental |
| 48 |
karen-pal/borges
Datasets de los textos de cuentos de varios autorxs latinoamericanxs.... |
|
Experimental |
| 49 |
derintelligence/en-az-parallel-corpus
English-Azerbaijani parallel language corpus |
|
Experimental |
| 50 |
AsoSoft/AsoSoft-Text-Corpus
AsoSoft Text Corpus is the first large scale text corpus for the Kurdish language. |
|
Experimental |
| 51 |
DOLMA-NLP/PARME
Parallel corpora for Middle Eastern languages - ACL2025 |
|
Experimental |
| 52 |
mannefedov/ru_kw_eval_datasets
Datasets for evaluation of keyword extraction in Russian |
|
Experimental |
| 53 |
mideind/GreynirCorpus
A large treebank of parsed Icelandic text |
|
Experimental |
| 54 |
KurdishBLARK/KTC
Kurdish Textbooks Corpus |
|
Experimental |
| 55 |
madrugado/gia-corpus
Corpus of exam tests for 9-graders in Russia for NLP/ML purposes |
|
Experimental |
| 56 |
steventan0110/align-filter
Repository for "Bitext Mining for Low-Resource Languages via Contrastive Learning" |
|
Experimental |
| 57 |
lirondos/coalas
COrpus of AngLicisms in the SpAnish PresS (COALAS) 🐨 |
|
Experimental |
| 58 |
ixa-ehu/cometa
Website of the CoMeta, a Corpus for Metaphor Detection in Spanish |
|
Experimental |
| 59 |
NLP-UMUTeam/Spanish-MisoCorpus-2020
Spanish MisoCorpus 2020 |
|
Experimental |
| 60 |
Ofis-publik-ar-brezhoneg/breton-french-corpus
Korpus divyezhek brezhoneg-galleg - Bilingual Breton-French corpus |
|
Experimental |
| 61 |
SaiedAlshahrani/Wikipedia-Corpora-Report
Wikipedia Corpora Meta Report: A Metadata Report of How Wikipedia Editions... |
|
Experimental |
| 62 |
TianciGao/RussScholar-Seeker
RussScholar-Seeker:A Python package for predicting whether a name is Russian... |
|
Experimental |
| 63 |
CoffeBank/Ru-hard-detection-dataset
Ru AI-text detection dataset / Русскоязычный датасет для оценки детекции... |
|
Experimental |
| 64 |
hassanzadehmahdi/BioPersianWikiAnalyzer
Persian Wikipedia Bioinformatics Page Crawler and Text Preprocessor |
|
Experimental |
| 65 |
mosesab/Corpus-based-synonym-finder
Finds the synonym of words in a language using a language corpus |
|
Experimental |
| 66 |
progmatix21/Chilka
A corpus server library with a document database backend. |
|
Experimental |
| 67 |
mmarmonier/ACReFOSC
A companion repository to the French OLDI Seed Corpus. This repository... |
|
Experimental |
| 68 |
josealzugaray/cayetana-corpus
NLP analysis of 107 political speech transcripts (TEI-XML corpus) — topic... |
|
Experimental |
| 69 |
SemiringInc/Mueller-Report-Corpus
The Mueller Report Corpus V 0.1 |
|
Experimental |
| 70 |
juletx/corpus-linguistics
Corpus Linguistics slides, labs, assignments and data |
|
Experimental |
| 71 |
bdar-lab/heb_architecture_corpus
Cleaned, parsed, and analyzed Hebrew textual corpus of documents pertaining... |
|
Experimental |
| 72 |
BrightXiaoHan/Yitextor
Parallel corpus processing toolkit forked from... |
|
Experimental |
| 73 |
Uyghur-Corpus/Uyghur-Corpus
Large-scale Uyghur corpus optimized for Large Language Models (LLMs) and NLP... |
|
Experimental |
| 74 |
peghaz/corpora-intersector
High-performance tool to find corpus words missing from large texts using... |
|
Experimental |
| 75 |
Tentakl3/Spanish-NLP-preprocessing
Customized tokenization and preprocessing of Natural Language in Spanish -... |
|
Experimental |
| 76 |
techiaith/corpws-meincnodi-rhannau-ymadrodd
Corpws ar gyfer meincnodi tagwyr rhannau ymadrodd Cymraeg | A corpus for... |
|
Experimental |
| 77 |
DusunDictionary/dusun-english-malay-corpus
A linguistic corpus of the Dusun language for NMT and LLM training |
|
Experimental |
| 78 |
CesarJNP/Depression-corpus-spanish
Spanish depression-labeled corpus (0/1) |
|
Experimental |
| 79 |
mirfan899/SpaCy3Urdu
Build Urdu SpaCy Model |
|
Experimental |