NLP Dataset Collections NLP Tools

Curated lists, catalogs, and repositories of NLP datasets organized by language, task, or domain. Does NOT include individual datasets, dataset creation tools, or data annotation platforms.

There are 101 nlp dataset collections tools tracked. 1 score above 70 (verified tier). The highest-rated is acl-org/acl-anthology at 76/100 with 693 stars. 1 of the top 10 are actively maintained.

Get all 101 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=nlp-dataset-collections&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	acl-org/acl-anthology Data and software for building the ACL Anthology.	76	Verified	693	Python
2	anoopkunchukuttan/indic_nlp_library Resources and tools for Indian language Natural Language Processing	64	Established	630	Python
3	CLUEbenchmark/CLUECorpus2020 Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料	53	Established	1,002	—
4	KennethEnevoldsen/scandinavian-embedding-benchmark A Scandinavian Benchmark for sentence embeddings	47	Emerging	46	Python
5	Separius/awesome-sentence-embedding A curated list of pretrained sentence and word embedding models	47	Emerging	2,290	Python
6	SudhirGadhvi/open-vernacular-ai-kit Clean Indian code-mixed text before it reaches your LLM.	46	Emerging	5	Python
7	AndyTheFactory/romanian-nlp-datasets A list of Romanian NLP Datasets	46	Emerging	56	—
8	banglakit/awesome-bangla A collection of tools, datasets and resources on Bangla computing	45	Emerging	564	—
9	AI4Bharat/Indic-BERT-v1 Indic-BERT-v1: BERT-based Multilingual Model for 11 Indic Languages and...	44	Emerging	291	Python
10	masakhane-io/masakhane-community All our community docs! Start here! Lets put Africa on the NLP Map	44	Emerging	67	—
11	mirfan899/Urdu Collection of Urdu datasets for POS, NER, Sentiment, Summarization and NLP tasks.	44	Emerging	73	—
12	knadh/indic.page A directory of Indic (Indian) language computing resources.	42	Emerging	65	HTML
13	dsfsi/masakhane-web Masakhane Web is a translation web application for solely African Languages.	41	Emerging	37	Jupyter Notebook
14	Smat26/Roman-Urdu-Dataset Compilation of Manually Tagged Roman Urdu Dataset (Urdu written in...	41	Emerging	34	—
15	shjwudp/c4-dataset-script Inspired by google c4, here is a series of colossal clean data cleaning...	41	Emerging	135	Python
16	praatibhsurana/Hinglish_Hindi_WSD A pipeline for transliteration, spell correction, POS tagging and word sense...	40	Emerging	37	Python
17	yisaienkov/tinysets The project aims to collect various datasets for tasks such as...	39	Emerging	6	Python
18	amir9ume/urdu_ghazals_rekhta Dataset for Urdu Ghazals	39	Emerging	20	Jupyter Notebook
19	jcblaisecruz02/Filipino-Text-Benchmarks Open-source benchmark datasets and pretrained transformer models in the...	38	Emerging	64	Python
20	CLUEbenchmark/CLUEPretrainedModels 高质量中文预训练模型集合：最先进大模型、最快小模型、相似度专门模型	38	Emerging	816	Python
21	computerclubkec/constitution-of-nepal-dataset A structured and organized dataset of the Constitution of Nepal in...	38	Emerging	7	—
22	uma-pi1/OPIEC Reading the data from OPIEC - an Open Information Extraction corpus	37	Emerging	38	Java
23	Vikhram-S/IndianConstitution A Python library for exploring the Constitution of India.	37	Emerging	2	Python
24	csebuetnlp/banglabert This repository contains the official release of the model "BanglaBERT" and...	36	Emerging	248	Python
25	cambridgeltl/cometa Corpus of Online Medical EnTities: the cometA corpus	36	Emerging	51	Jupyter Notebook
26	federicarollo/Italian-Crime-News A dataset from the Gazzetta di Modena newspaper about crime events in the...	36	Emerging	7	Java
27	banglanlp/bnlp-resources Awesome datasets for Bangla language computing.	35	Emerging	64	Python
28	zhanlaoban/NLP_PEMDC NLP Predtrained Embeddings, Models and Datasets Collections(NLP_PEMDC). The...	34	Emerging	65	—
29	UsmanNiazi/DUC-2004-Dataset This Repo Contains the DUC 2004 Dataset	34	Emerging	5	—
30	jacklanda/CCAE [NLPCC 2023] CCAE: A Corpus of Chinese-based Asian Englishes	33	Emerging	57	Python
31	MuhammadYaseenKhan/Urdu-Sentiment-Corpus Labelled Dataset for Urdu Sentiment Analysis	32	Emerging	9	—
32	anoopkunchukuttan/meteor_indic METEOR for Indian languages (originally forked from METEOR 1.4)	31	Emerging	3	Java
33	mussacharles60/swahili-dictionary Swahili dictionary for implementing in your projects	31	Emerging	4	JavaScript
34	Sueza-project/Sueza_project Linguistic database collection for the revitalization of Cameroonian local...	31	Emerging	2	HTML
35	s-bose/Walks-into-a-bar-dataset A dataset containing 1000+ walks-into-a-bar jokes scraped from the internet.	31	Emerging	2	Jupyter Notebook
36	crux82/huric HuRIC 2.0 - the Human Robot Interaction Corpus	31	Emerging	17	—
37	lanwuwei/Twitter-URL-Corpus Large scale sentential paraphrases collection and annotation	30	Emerging	46	HTML
38	Riccorl/nlp-dataset-readers Readers for NLP Datasets	30	Emerging	3	Python
39	EthioNLP/Resource This repository contains research papers and datasets for different NLP...	30	Emerging	1	—
40	Andrews2017/africanlp-public-datasets A repository for publicly/freely available Natural Language Processing (NLP)...	30	Emerging	114	—
41	hrgupta/indian-scriptures This repository contains various Indian scriptures 📜 in a structured .csv...	29	Experimental	3	Jupyter Notebook
42	UKPLab/useb Heterogenous, Task- and Domain-Specific Benchmark for Unsupervised Sentence...	29	Experimental	29	Python
43	mrpeerat/Thai-Sentence-Vector-Benchmark Benchmark for Thai sentence representation	29	Experimental	133	Jupyter Notebook
44	kili-technology/awesome-datasets A comprehensive list of annotated training datasets classified by use case.	29	Experimental	38	—
45	t-systems-on-site-services-gmbh/german-elmo-model This is a german ELMo deep contextualized word representation. It is trained...	27	Experimental	28	—
46	Hironsan/wiki-article-dataset Wikipedia article dataset	27	Experimental	12	Jupyter Notebook
47	mapmeld/hindi-bert Hindi NLP work	27	Experimental	14	Jupyter Notebook
48	COS301-SE-2025/Mafoko Mafoko is a progressive web app (PWA) that provides access to multilingual...	27	Experimental	2	TypeScript
49	maxent-ai/Datasets datasets with text data for use in NLP, Text analysis, information...	27	Experimental	16	Jupyter Notebook
50	kassemsabeh/open-brand The dataset contains over 250k product brand-value annotations with more...	27	Experimental	14	Python
51	massanishi/hackernews-post-datasets Datasets for hackernews posts	27	Experimental	16	—
52	reem-codes/ArMATH ArMATH: The Arabic Math Word Problem dataset. Accepted in LREC2022	27	Experimental	10	Python
53	SuzanaK/language_datasets Language Datasets for NLP, Machine Learning, and Map Creation	26	Experimental	6	—
54	hyunwoongko/nlp-datasets Curation note of NLP datasets	26	Experimental	98	—
55	Pogayo/Luo-News-Dataset This repo contains LUO corpus for Named Entity Recognition. The text comes...	26	Experimental	7	—
56	aalok-sathe/sentspace a module to obtain diverse real-world-grounded features for sentences for...	26	Experimental	5	Python
57	quality-attributes/datasets Official data sources for the Quality Attributes project	25	Experimental	6	Jupyter Notebook
58	OumaimaHourrane/MA_Open_Datasets Moroccan NLP Datasets and Corpora	23	Experimental	3	Jupyter Notebook
59	nlp-waseda/comet-atomic-ja COMET-ATOMIC ja	23	Experimental	31	Python
60	filbench/filbench-eval Experiments and Analyses for FilBench: An Open LLM Leaderboard for Filipino...	22	Experimental	9	Python
61	VLa-Labs/Danish-Language-Dataset-List A curated metadata collection of 31 publicly available Danish language datasets.	22	Experimental	3	—
62	megagonlabs/ebe-dataset Evidence-based Explanation Dataset (AACL-IJCNLP 2020)	22	Experimental	18	PLSQL
63	aviaefrat/cryptonite The Official Repository of the Cryptonite Dataset	21	Experimental	23	Python
64	pln-fing-udelar/humor HUMOR dataset for humor research	21	Experimental	7	HTML
65	ART-Group-it/GASP GASP! Dataset - Generating Abstracts of Scientific Papers from Abstracts of...	21	Experimental	9	—
66	mohansaidinesh/Datasets Datasets for Machine Learning	21	Experimental	4	Python
67	sniperx-19/awesome-sentence-embedding A curated list of pretrained sentence and word embedding models	20	Experimental	5	Python
68	Niger-Volta-LTI/urhobo-text Urhobo language training text for NLP, ASR and TTS tasks	20	Experimental	6	—
69	kaisugi/datasets-for-sequential-sentence-classification Curated list of public datasets which focus on sentence classification in...	20	Experimental	5	—
70	ICPSR/dataset-references NER pipeline to detect dataset references for ASIST 2022 paper	19	Experimental	3	Jupyter Notebook
71	dsfsi/PuoData Curated corpora for Setswana. Used to train PuoBERTa.	19	Experimental	3	—
72	OpenCENIA/SRN Spanish Resources and Evaluation	19	Experimental	3	—
73	KushtrimVisoka/Kosovo-Parliament-Transcriptions NOTE: The dataset is maintained exclusively on HuggingFace Datasets. The...	19	Experimental	3	Jupyter Notebook
74	dsfsi/project-state-capture Zondo Commission or State Capture Commission Transcripts	19	Experimental	3	—
75	jonas-becker/pd-human-vs-machine-content The official repository for the paper "Paraphrase Detection: Human vs....	19	Experimental	3	HTML
76	slvnwhrl/sigmorphon2022-models This repository contains the models used by the CLUZH team for the...	19	Experimental	3	Python
77	bluechoochoo/retired_comedy_phrases A Casual Spreadsheets resource	19	Experimental	13	—
78	Archaeocomputers/Bessarion A text and imaging dataset of Byzantine-era Medieval Greek inscriptions.	19	Experimental	4	Python
79	createmomo/supporting-comedy-writers Predicting Audience’s Response from Sketch Comedy and Crosstalk Scripts (A...	19	Experimental	3	—
80	mzmmoazam/kashmiri_dataset Data and tool to fetch kashmiri text	19	Experimental	16	HTML
81	NetworkTheoryAppliedResearchInstitute/anthropology- Comprehensive AI training corpus for anthropology education: 580K tokens...	19	Experimental	—	—
82	CyberAgentAILab/AdParaphrase This repository contains data for our paper "AdParaphrase: Paraphrase...	19	Experimental	1	—
83	radi-cho/noisy-sentences-dataset 550K sentences in 5 European languages augmented with noise for training and...	18	Experimental	2	—
84	NoelShallum/all-indian-acts Repository containing all Indian Acts and statutes in the PDF and txt...	18	Experimental	2	—
85	rmdodhia/dataset-detection Detects datasets used in journal papers	18	Experimental	2	Python
86	dsfsi/zabantu-beta ZaBantu is a fleet of light-weight Masked Language Models for Southern Bantu...	18	Experimental	2	Python
87	metriccoders/metriccoders_datasets This is the Metric Coders repository containing all the datasets for machine...	17	Experimental	1	—
88	felixgiov/public-meeting Dataset from the paper "Information Extraction from Public Meeting Articles"	17	Experimental	1	—
89	jahidulzaid/BanglaNostalgia A benchmark and training pipeline for detecting nostalgia in Bangla text....	17	Experimental	2	Python
90	BrianMsane/siSwati-Datasets Repository for siSwati NLP datasets which I have worked on in my research....	17	Experimental	1	—
91	davidwarrior22/machine-translation-for-african-languages This repository focuses on developing machine translation and NLP tools...	14	Experimental	—	TeX
92	sanjanalreddy/NLP-Datasets List of NLP Datasets	13	Experimental	10	—
93	stefan-it/gc4lm GC4LM: A Colossal (Biased) language model for German	13	Experimental	13	—
94	mikesdatawork/ai-ml-datasets-hub Curated collection of high-quality datasets optimized for AI/ML pipelines,...	12	Experimental	1	Python
95	zyuanlim/singlish-manglish-nlp Resources for Singlish and Manglish NLP.	12	Experimental	8	Jupyter Notebook
96	Mufassir-Chowdhury/BnPC This is the official repository of the paper titled "BnPC: A Gold Standard...	11	Experimental	4	Jupyter Notebook
97	MusfiqDehan/bn-en-aligner Tool to easily align Bangla and English words from sentences	11	Experimental	—	JavaScript
98	Unipisa/admin-It Dataset for automatic readability assessment and text simplification of...	11	Experimental	3	—
99	cvjena/chiasmus-annotations German Chiasmus Dataset	11	Experimental	3	Python
100	Aman-byte1/amharic-conversation-and-math-dataset የቁጥር ምላሾች የተሰጡባቸውን የአማርኛ ቃላዊ ጥያቄዎች እና በእንግሊዝኛ እና በአማርኛ የተደረጉ የውይይት ልውውጦችን...	10	Experimental	1	—
101	shercostiniano/filipino-stoytelling-ner Open-source repository for our paper in Thesis 1	10	Experimental	2	Jupyter Notebook

Comparisons in this category

CLUECorpus2020 and CLUEPretrainedModels (53 vs 38)