NLP Dataset Collections NLP Tools
Curated lists, catalogs, and repositories of NLP datasets organized by language, task, or domain. Does NOT include individual datasets, dataset creation tools, or data annotation platforms.
There are 101 nlp dataset collections tools tracked. 1 score above 70 (verified tier). The highest-rated is acl-org/acl-anthology at 76/100 with 693 stars. 1 of the top 10 are actively maintained.
Get all 101 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=nlp-dataset-collections&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
acl-org/acl-anthology
Data and software for building the ACL Anthology. |
|
Verified |
| 2 |
anoopkunchukuttan/indic_nlp_library
Resources and tools for Indian language Natural Language Processing |
|
Established |
| 3 |
CLUEbenchmark/CLUECorpus2020
Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料 |
|
Established |
| 4 |
KennethEnevoldsen/scandinavian-embedding-benchmark
A Scandinavian Benchmark for sentence embeddings |
|
Emerging |
| 5 |
Separius/awesome-sentence-embedding
A curated list of pretrained sentence and word embedding models |
|
Emerging |
| 6 |
SudhirGadhvi/open-vernacular-ai-kit
Clean Indian code-mixed text before it reaches your LLM. |
|
Emerging |
| 7 |
AndyTheFactory/romanian-nlp-datasets
A list of Romanian NLP Datasets |
|
Emerging |
| 8 |
banglakit/awesome-bangla
A collection of tools, datasets and resources on Bangla computing |
|
Emerging |
| 9 |
AI4Bharat/Indic-BERT-v1
Indic-BERT-v1: BERT-based Multilingual Model for 11 Indic Languages and... |
|
Emerging |
| 10 |
masakhane-io/masakhane-community
All our community docs! Start here! Lets put Africa on the NLP Map |
|
Emerging |
| 11 |
mirfan899/Urdu
Collection of Urdu datasets for POS, NER, Sentiment, Summarization and NLP tasks. |
|
Emerging |
| 12 |
knadh/indic.page
A directory of Indic (Indian) language computing resources. |
|
Emerging |
| 13 |
dsfsi/masakhane-web
Masakhane Web is a translation web application for solely African Languages. |
|
Emerging |
| 14 |
Smat26/Roman-Urdu-Dataset
Compilation of Manually Tagged Roman Urdu Dataset (Urdu written in... |
|
Emerging |
| 15 |
shjwudp/c4-dataset-script
Inspired by google c4, here is a series of colossal clean data cleaning... |
|
Emerging |
| 16 |
praatibhsurana/Hinglish_Hindi_WSD
A pipeline for transliteration, spell correction, POS tagging and word sense... |
|
Emerging |
| 17 |
yisaienkov/tinysets
The project aims to collect various datasets for tasks such as... |
|
Emerging |
| 18 |
amir9ume/urdu_ghazals_rekhta
Dataset for Urdu Ghazals |
|
Emerging |
| 19 |
jcblaisecruz02/Filipino-Text-Benchmarks
Open-source benchmark datasets and pretrained transformer models in the... |
|
Emerging |
| 20 |
CLUEbenchmark/CLUEPretrainedModels
高质量中文预训练模型集合:最先进大模型、最快小模型、相似度专门模型 |
|
Emerging |
| 21 |
computerclubkec/constitution-of-nepal-dataset
A structured and organized dataset of the Constitution of Nepal in... |
|
Emerging |
| 22 |
uma-pi1/OPIEC
Reading the data from OPIEC - an Open Information Extraction corpus |
|
Emerging |
| 23 |
Vikhram-S/IndianConstitution
A Python library for exploring the Constitution of India. |
|
Emerging |
| 24 |
csebuetnlp/banglabert
This repository contains the official release of the model "BanglaBERT" and... |
|
Emerging |
| 25 |
cambridgeltl/cometa
Corpus of Online Medical EnTities: the cometA corpus |
|
Emerging |
| 26 |
federicarollo/Italian-Crime-News
A dataset from the Gazzetta di Modena newspaper about crime events in the... |
|
Emerging |
| 27 |
banglanlp/bnlp-resources
Awesome datasets for Bangla language computing. |
|
Emerging |
| 28 |
zhanlaoban/NLP_PEMDC
NLP Predtrained Embeddings, Models and Datasets Collections(NLP_PEMDC). The... |
|
Emerging |
| 29 |
UsmanNiazi/DUC-2004-Dataset
This Repo Contains the DUC 2004 Dataset |
|
Emerging |
| 30 |
jacklanda/CCAE
[NLPCC 2023] CCAE: A Corpus of Chinese-based Asian Englishes |
|
Emerging |
| 31 |
MuhammadYaseenKhan/Urdu-Sentiment-Corpus
Labelled Dataset for Urdu Sentiment Analysis |
|
Emerging |
| 32 |
anoopkunchukuttan/meteor_indic
METEOR for Indian languages (originally forked from METEOR 1.4) |
|
Emerging |
| 33 |
mussacharles60/swahili-dictionary
Swahili dictionary for implementing in your projects |
|
Emerging |
| 34 |
Sueza-project/Sueza_project
Linguistic database collection for the revitalization of Cameroonian local... |
|
Emerging |
| 35 |
s-bose/Walks-into-a-bar-dataset
A dataset containing 1000+ walks-into-a-bar jokes scraped from the internet. |
|
Emerging |
| 36 |
crux82/huric
HuRIC 2.0 - the Human Robot Interaction Corpus |
|
Emerging |
| 37 |
lanwuwei/Twitter-URL-Corpus
Large scale sentential paraphrases collection and annotation |
|
Emerging |
| 38 |
Riccorl/nlp-dataset-readers
Readers for NLP Datasets |
|
Emerging |
| 39 |
EthioNLP/Resource
This repository contains research papers and datasets for different NLP... |
|
Emerging |
| 40 |
Andrews2017/africanlp-public-datasets
A repository for publicly/freely available Natural Language Processing (NLP)... |
|
Emerging |
| 41 |
hrgupta/indian-scriptures
This repository contains various Indian scriptures 📜 in a structured .csv... |
|
Experimental |
| 42 |
UKPLab/useb
Heterogenous, Task- and Domain-Specific Benchmark for Unsupervised Sentence... |
|
Experimental |
| 43 |
mrpeerat/Thai-Sentence-Vector-Benchmark
Benchmark for Thai sentence representation |
|
Experimental |
| 44 |
kili-technology/awesome-datasets
A comprehensive list of annotated training datasets classified by use case. |
|
Experimental |
| 45 |
t-systems-on-site-services-gmbh/german-elmo-model
This is a german ELMo deep contextualized word representation. It is trained... |
|
Experimental |
| 46 |
Hironsan/wiki-article-dataset
Wikipedia article dataset |
|
Experimental |
| 47 |
mapmeld/hindi-bert
Hindi NLP work |
|
Experimental |
| 48 |
COS301-SE-2025/Mafoko
Mafoko is a progressive web app (PWA) that provides access to multilingual... |
|
Experimental |
| 49 |
maxent-ai/Datasets
datasets with text data for use in NLP, Text analysis, information... |
|
Experimental |
| 50 |
kassemsabeh/open-brand
The dataset contains over 250k product brand-value annotations with more... |
|
Experimental |
| 51 |
massanishi/hackernews-post-datasets
Datasets for hackernews posts |
|
Experimental |
| 52 |
reem-codes/ArMATH
ArMATH: The Arabic Math Word Problem dataset. Accepted in LREC2022 |
|
Experimental |
| 53 |
SuzanaK/language_datasets
Language Datasets for NLP, Machine Learning, and Map Creation |
|
Experimental |
| 54 |
hyunwoongko/nlp-datasets
Curation note of NLP datasets |
|
Experimental |
| 55 |
Pogayo/Luo-News-Dataset
This repo contains LUO corpus for Named Entity Recognition. The text comes... |
|
Experimental |
| 56 |
aalok-sathe/sentspace
a module to obtain diverse real-world-grounded features for sentences for... |
|
Experimental |
| 57 |
quality-attributes/datasets
Official data sources for the Quality Attributes project |
|
Experimental |
| 58 |
OumaimaHourrane/MA_Open_Datasets
Moroccan NLP Datasets and Corpora |
|
Experimental |
| 59 |
nlp-waseda/comet-atomic-ja
COMET-ATOMIC ja |
|
Experimental |
| 60 |
filbench/filbench-eval
Experiments and Analyses for FilBench: An Open LLM Leaderboard for Filipino... |
|
Experimental |
| 61 |
VLa-Labs/Danish-Language-Dataset-List
A curated metadata collection of 31 publicly available Danish language datasets. |
|
Experimental |
| 62 |
megagonlabs/ebe-dataset
Evidence-based Explanation Dataset (AACL-IJCNLP 2020) |
|
Experimental |
| 63 |
aviaefrat/cryptonite
The Official Repository of the Cryptonite Dataset |
|
Experimental |
| 64 |
pln-fing-udelar/humor
HUMOR dataset for humor research |
|
Experimental |
| 65 |
ART-Group-it/GASP
GASP! Dataset - Generating Abstracts of Scientific Papers from Abstracts of... |
|
Experimental |
| 66 |
mohansaidinesh/Datasets
Datasets for Machine Learning |
|
Experimental |
| 67 |
sniperx-19/awesome-sentence-embedding
A curated list of pretrained sentence and word embedding models |
|
Experimental |
| 68 |
Niger-Volta-LTI/urhobo-text
Urhobo language training text for NLP, ASR and TTS tasks |
|
Experimental |
| 69 |
kaisugi/datasets-for-sequential-sentence-classification
Curated list of public datasets which focus on sentence classification in... |
|
Experimental |
| 70 |
ICPSR/dataset-references
NER pipeline to detect dataset references for ASIST 2022 paper |
|
Experimental |
| 71 |
dsfsi/PuoData
Curated corpora for Setswana. Used to train PuoBERTa. |
|
Experimental |
| 72 |
OpenCENIA/SRN
Spanish Resources and Evaluation |
|
Experimental |
| 73 |
KushtrimVisoka/Kosovo-Parliament-Transcriptions
NOTE: The dataset is maintained exclusively on HuggingFace Datasets. The... |
|
Experimental |
| 74 |
dsfsi/project-state-capture
Zondo Commission or State Capture Commission Transcripts |
|
Experimental |
| 75 |
jonas-becker/pd-human-vs-machine-content
The official repository for the paper "Paraphrase Detection: Human vs.... |
|
Experimental |
| 76 |
slvnwhrl/sigmorphon2022-models
This repository contains the models used by the CLUZH team for the... |
|
Experimental |
| 77 |
bluechoochoo/retired_comedy_phrases
A Casual Spreadsheets resource |
|
Experimental |
| 78 |
Archaeocomputers/Bessarion
A text and imaging dataset of Byzantine-era Medieval Greek inscriptions. |
|
Experimental |
| 79 |
createmomo/supporting-comedy-writers
Predicting Audience’s Response from Sketch Comedy and Crosstalk Scripts (A... |
|
Experimental |
| 80 |
mzmmoazam/kashmiri_dataset
Data and tool to fetch kashmiri text |
|
Experimental |
| 81 |
NetworkTheoryAppliedResearchInstitute/anthropology-
Comprehensive AI training corpus for anthropology education: 580K tokens... |
|
Experimental |
| 82 |
CyberAgentAILab/AdParaphrase
This repository contains data for our paper "AdParaphrase: Paraphrase... |
|
Experimental |
| 83 |
radi-cho/noisy-sentences-dataset
550K sentences in 5 European languages augmented with noise for training and... |
|
Experimental |
| 84 |
NoelShallum/all-indian-acts
Repository containing all Indian Acts and statutes in the PDF and txt... |
|
Experimental |
| 85 |
rmdodhia/dataset-detection
Detects datasets used in journal papers |
|
Experimental |
| 86 |
dsfsi/zabantu-beta
ZaBantu is a fleet of light-weight Masked Language Models for Southern Bantu... |
|
Experimental |
| 87 |
metriccoders/metriccoders_datasets
This is the Metric Coders repository containing all the datasets for machine... |
|
Experimental |
| 88 |
felixgiov/public-meeting
Dataset from the paper "Information Extraction from Public Meeting Articles" |
|
Experimental |
| 89 |
jahidulzaid/BanglaNostalgia
A benchmark and training pipeline for detecting nostalgia in Bangla text.... |
|
Experimental |
| 90 |
BrianMsane/siSwati-Datasets
Repository for siSwati NLP datasets which I have worked on in my research.... |
|
Experimental |
| 91 |
davidwarrior22/machine-translation-for-african-languages
This repository focuses on developing machine translation and NLP tools... |
|
Experimental |
| 92 |
sanjanalreddy/NLP-Datasets
List of NLP Datasets |
|
Experimental |
| 93 |
stefan-it/gc4lm
GC4LM: A Colossal (Biased) language model for German |
|
Experimental |
| 94 |
mikesdatawork/ai-ml-datasets-hub
Curated collection of high-quality datasets optimized for AI/ML pipelines,... |
|
Experimental |
| 95 |
zyuanlim/singlish-manglish-nlp
Resources for Singlish and Manglish NLP. |
|
Experimental |
| 96 |
Mufassir-Chowdhury/BnPC
This is the official repository of the paper titled "BnPC: A Gold Standard... |
|
Experimental |
| 97 |
MusfiqDehan/bn-en-aligner
Tool to easily align Bangla and English words from sentences |
|
Experimental |
| 98 |
Unipisa/admin-It
Dataset for automatic readability assessment and text simplification of... |
|
Experimental |
| 99 |
cvjena/chiasmus-annotations
German Chiasmus Dataset |
|
Experimental |
| 100 |
Aman-byte1/amharic-conversation-and-math-dataset
የቁጥር ምላሾች የተሰጡባቸውን የአማርኛ ቃላዊ ጥያቄዎች እና በእንግሊዝኛ እና በአማርኛ የተደረጉ የውይይት ልውውጦችን... |
|
Experimental |
| 101 |
shercostiniano/filipino-stoytelling-ner
Open-source repository for our paper in Thesis 1 |
|
Experimental |