NLP Corpus Datasets NLP Tools

Curated collections, loaders, and databases of text corpora for NLP research and training. Includes corpus compilation tools, domain-specific annotated datasets, and corpus management systems. Does NOT include tools for corpus analysis, linguistic annotation frameworks, or applications built on top of corpora.

There are 79 nlp corpus datasets tools tracked. 6 score above 50 (established tier). The highest-rated is Helsinki-NLP/OpusFilter at 65/100 with 115 stars.

Get all 79 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=nlp-corpus-datasets&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	Helsinki-NLP/OpusFilter OpusFilter - Parallel corpus processing toolkit	65	Established	115	Python
2	natasha/corus Links to Russian corpora + Python functions for loading and parsing	57	Established	310	Jupyter Notebook
3	SergeyShk/ruTS Библиотека для извлечения статистик из текстов на русском языке.	52	Established	125	Python
4	darija-open-dataset/dataset darija <-> english dataset	52	Established	363	—
5	omicsNLP/Auto-CORPus Auto-CORPus pipeline developed by a University of Nottingham and Imperial...	52	Established	22	HTML
6	texttechnologylab/GerParCor German Parliamentary Corpus (GerParCor)	50	Established	30	Java
7	texttechnologylab/UCE The Unified Corpus Explorer (UCE) for UIMA-annotated Corpora.	49	Emerging	7	Java
8	natasha/nerus Large silver standart Russian corpus with NER, morphology and syntax markup	48	Emerging	73	Python
9	Koziev/NLP_Datasets My NLP datasets for Russian language	45	Emerging	386	C#
10	fido-ai/ua-datasets A collection of datasets for Ukrainian language	44	Emerging	56	Python
11	bureaucratic-labs/dostoevsky Sentiment analysis library for russian language	43	Emerging	320	Python
12	JuliaText/CorpusLoaders.jl A variety of loaders for various NLP corpora.	41	Emerging	32	Julia
13	M4t1ss/parallel-corpora-tools Tools for filtering and cleaning parallel and monolingual corpora for...	41	Emerging	41	PHP
14	notesjor/corpusexplorer2.0 Korpuslinguistik war noch nie so einfach...	40	Emerging	25	C#
15	JonathanReeve/corpus-db A textual corpus database for the digital humanities.	38	Emerging	63	Jupyter Notebook
16	ericleasemorgan/reader Distant Reader, a tool for using & understanding a corpus	38	Emerging	20	Shell
17	microsoft/Clandestino Repository for the Clandestino corpus	37	Emerging	10	—
18	adbar/German-NLP Curated list of open-access/open-source/off-the-shelf resources and tools...	37	Emerging	518	—
19	josecannete/spanish-corpora Unannotated Spanish 3 Billion Words Corpora	37	Emerging	104	Python
20	t-systems-on-site-services-gmbh/german-wikipedia-text-corpus This is a german text corpus from Wikipedia. It is cleaned, preprocessed and...	36	Emerging	23	—
21	KurdishBLARK/InterdialectCorpus A parallel corpus of Sorani, Kurmanji and English	36	Emerging	15	—
22	yutkin/Lenta.Ru-News-Dataset Corpus of Russian news articles collected from Lenta.Ru	36	Emerging	145	Python
23	maxoodf/russian_news_corpus Russian mass media stemmed texts corpus / Корпус лемматизированных...	36	Emerging	93	—
24	practikpharma/PGxCorpus PGxCorpus, a manually annotated corpus, designed for the extraction of...	35	Emerging	8	Lua
25	ilinguistics/corpus_similarity Measure the similarity of text corpora for 74 languages	35	Emerging	14	Python
26	velkadamban/Tamil-Corpus This nTamil project aims to create a comprehensive and high-quality...	33	Emerging	5	Roff
27	ajithalbus/TamilCorpus Open Source Tamil Corpus of 58M words	33	Emerging	11	Shell
28	rashiedomar/somali-wikipedia-corpus Cleaned Somali Wikipedia corpus (~9,500 articles) for NLP, LLM training, and...	33	Emerging	5	—
29	madhav1k/OpenCorpus A multilingual compilation of open-source textual corpora across major &...	32	Emerging	4	—
30	microsoft/BrevE-CLaro Repository for the BrevE and CLaro datasets.	31	Emerging	4	—
31	somosnlp/corpus-es Lista de corpus de PLN en español ✨ #Somos600M: Ayuda a desarrollar IA...	30	Emerging	25	Python
32	notesjor/CorpusExplorer.Terminal.Console Erlaubt anderen Programmen/Programmiersprachen den Zugriff auf...	30	Emerging	7	C#
33	ilinguistics/earthLings Corpus-based language and dialect mapping	30	Emerging	7	—
34	SpydazWebAI-NLP/BasicCorpus2023 A Basic Corpus Object , Giving Positional Encoding / Decoding . ,A Fully...	30	Emerging	1	Visual Basic .NET
35	Digital-Pushkin-Lab/RuAdapt_Word_Lists Word alignments from Russian-Simple Russian parallel data	30	Emerging	6	—
36	juwiragiye/ikirundi The Ikirundi Corpus Project aims to create a comprehensive collection of...	29	Experimental	1	Python
37	stdlib-js/datasets-moby-dick The text of Moby Dick by Herman Melville.	29	Experimental	4	JavaScript
38	davide-ghidelli-business/OpenCorpus OpenCorpus is a collection of open-source textual corpora from various...	29	Experimental	1	—
39	kateryna-bobrovnyk/ukr-twi-corpus A corpus of Ukrainian Twitter texts + instructions for downloading and...	28	Experimental	15	Jupyter Notebook
40	kscanne/chichewa NLP resources for Chichewa	28	Experimental	10	Makefile
41	d0rj/RusLit 📚 A small collection of Russian literature 📚	28	Experimental	13	—
42	DFKI-NLP/product-corpus This repository contains the DFKI Product Corpus, a dataset of 174 documents...	27	Experimental	12	—
43	Kartikaggarwal98/Indian_ParallelCorpus Curated list of publicly available parallel corpus for Indian Languages	27	Experimental	37	—
44	AlexKly/Detailed-NER-Dataset-RU Labeled Russian text token-by-token for training models for NER task based...	26	Experimental	10	Python
45	gambolputty/textstelle Textstelle is a collection of corpora for the creation of bots and other...	25	Experimental	21	—
46	Digital-Pushkin-Lab/RuAdapt A Parallel Russian-Simple Russian Dataset	24	Experimental	15	—
47	SaiedAlshahrani/performance-implications Performance Implications of Using Unrepresentative Corpora in Arabic Natural...	23	Experimental	3	Jupyter Notebook
48	karen-pal/borges Datasets de los textos de cuentos de varios autorxs latinoamericanxs....	23	Experimental	16	Jupyter Notebook
49	derintelligence/en-az-parallel-corpus English-Azerbaijani parallel language corpus	22	Experimental	20	—
50	AsoSoft/AsoSoft-Text-Corpus AsoSoft Text Corpus is the first large scale text corpus for the Kurdish language.	22	Experimental	27	—
51	DOLMA-NLP/PARME Parallel corpora for Middle Eastern languages - ACL2025	22	Experimental	8	Python
52	mannefedov/ru_kw_eval_datasets Datasets for evaluation of keyword extraction in Russian	21	Experimental	31	—
53	mideind/GreynirCorpus A large treebank of parsed Icelandic text	20	Experimental	8	—
54	KurdishBLARK/KTC Kurdish Textbooks Corpus	20	Experimental	8	—
55	madrugado/gia-corpus Corpus of exam tests for 9-graders in Russia for NLP/ML purposes	20	Experimental	8	—
56	steventan0110/align-filter Repository for "Bitext Mining for Low-Resource Languages via Contrastive Learning"	20	Experimental	5	Python
57	lirondos/coalas COrpus of AngLicisms in the SpAnish PresS (COALAS) 🐨	19	Experimental	4	—
58	ixa-ehu/cometa Website of the CoMeta, a Corpus for Metaphor Detection in Spanish	19	Experimental	4	Python
59	NLP-UMUTeam/Spanish-MisoCorpus-2020 Spanish MisoCorpus 2020	19	Experimental	—	—
60	Ofis-publik-ar-brezhoneg/breton-french-corpus Korpus divyezhek brezhoneg-galleg - Bilingual Breton-French corpus	19	Experimental	1	—
61	SaiedAlshahrani/Wikipedia-Corpora-Report Wikipedia Corpora Meta Report: A Metadata Report of How Wikipedia Editions...	18	Experimental	2	Python
62	TianciGao/RussScholar-Seeker RussScholar-Seeker：A Python package for predicting whether a name is Russian...	18	Experimental	2	Python
63	CoffeBank/Ru-hard-detection-dataset Ru AI-text detection dataset / Русскоязычный датасет для оценки детекции...	18	Experimental	1	—
64	hassanzadehmahdi/BioPersianWikiAnalyzer Persian Wikipedia Bioinformatics Page Crawler and Text Preprocessor	17	Experimental	1	Jupyter Notebook
65	mosesab/Corpus-based-synonym-finder Finds the synonym of words in a language using a language corpus	17	Experimental	1	Python
66	progmatix21/Chilka A corpus server library with a document database backend.	17	Experimental	1	Python
67	mmarmonier/ACReFOSC A companion repository to the French OLDI Seed Corpus. This repository...	14	Experimental	1	Jupyter Notebook
68	josealzugaray/cayetana-corpus NLP analysis of 107 political speech transcripts (TEI-XML corpus) — topic...	14	Experimental	—	HTML
69	SemiringInc/Mueller-Report-Corpus The Mueller Report Corpus V 0.1	13	Experimental	11	—
70	juletx/corpus-linguistics Corpus Linguistics slides, labs, assignments and data	12	Experimental	7	R
71	bdar-lab/heb_architecture_corpus Cleaned, parsed, and analyzed Hebrew textual corpus of documents pertaining...	12	Experimental	6	—
72	BrightXiaoHan/Yitextor Parallel corpus processing toolkit forked from...	11	Experimental	2	Python
73	Uyghur-Corpus/Uyghur-Corpus Large-scale Uyghur corpus optimized for Large Language Models (LLMs) and NLP...	11	Experimental	—	—
74	peghaz/corpora-intersector High-performance tool to find corpus words missing from large texts using...	11	Experimental	—	OCaml
75	Tentakl3/Spanish-NLP-preprocessing Customized tokenization and preprocessing of Natural Language in Spanish -...	11	Experimental	2	Python
76	techiaith/corpws-meincnodi-rhannau-ymadrodd Corpws ar gyfer meincnodi tagwyr rhannau ymadrodd Cymraeg \| A corpus for...	11	Experimental	—	—
77	DusunDictionary/dusun-english-malay-corpus A linguistic corpus of the Dusun language for NMT and LLM training	11	Experimental	—	—
78	CesarJNP/Depression-corpus-spanish Spanish depression-labeled corpus (0/1)	11	Experimental	—	Python
79	mirfan899/SpaCy3Urdu Build Urdu SpaCy Model	11	Experimental	3	Python