NLP Dataset Collections NLP Tools

Curated lists, catalogs, and repositories of NLP datasets organized by language, task, or domain. Does NOT include individual datasets, dataset creation tools, or data annotation platforms.

There are 101 nlp dataset collections tools tracked. 1 score above 70 (verified tier). The highest-rated is acl-org/acl-anthology at 76/100 with 693 stars. 1 of the top 10 are actively maintained.

Get all 101 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=nlp-dataset-collections&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 acl-org/acl-anthology

Data and software for building the ACL Anthology.

76
Verified
2 anoopkunchukuttan/indic_nlp_library

Resources and tools for Indian language Natural Language Processing

64
Established
3 CLUEbenchmark/CLUECorpus2020

Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料

53
Established
4 KennethEnevoldsen/scandinavian-embedding-benchmark

A Scandinavian Benchmark for sentence embeddings

47
Emerging
5 Separius/awesome-sentence-embedding

A curated list of pretrained sentence and word embedding models

47
Emerging
6 SudhirGadhvi/open-vernacular-ai-kit

Clean Indian code-mixed text before it reaches your LLM.

46
Emerging
7 AndyTheFactory/romanian-nlp-datasets

A list of Romanian NLP Datasets

46
Emerging
8 banglakit/awesome-bangla

A collection of tools, datasets and resources on Bangla computing

45
Emerging
9 AI4Bharat/Indic-BERT-v1

Indic-BERT-v1: BERT-based Multilingual Model for 11 Indic Languages and...

44
Emerging
10 masakhane-io/masakhane-community

All our community docs! Start here! Lets put Africa on the NLP Map

44
Emerging
11 mirfan899/Urdu

Collection of Urdu datasets for POS, NER, Sentiment, Summarization and NLP tasks.

44
Emerging
12 knadh/indic.page

A directory of Indic (Indian) language computing resources.

42
Emerging
13 dsfsi/masakhane-web

Masakhane Web is a translation web application for solely African Languages.

41
Emerging
14 Smat26/Roman-Urdu-Dataset

Compilation of Manually Tagged Roman Urdu Dataset (Urdu written in...

41
Emerging
15 shjwudp/c4-dataset-script

Inspired by google c4, here is a series of colossal clean data cleaning...

41
Emerging
16 praatibhsurana/Hinglish_Hindi_WSD

A pipeline for transliteration, spell correction, POS tagging and word sense...

40
Emerging
17 yisaienkov/tinysets

The project aims to collect various datasets for tasks such as...

39
Emerging
18 amir9ume/urdu_ghazals_rekhta

Dataset for Urdu Ghazals

39
Emerging
19 jcblaisecruz02/Filipino-Text-Benchmarks

Open-source benchmark datasets and pretrained transformer models in the...

38
Emerging
20 CLUEbenchmark/CLUEPretrainedModels

高质量中文预训练模型集合:最先进大模型、最快小模型、相似度专门模型

38
Emerging
21 computerclubkec/constitution-of-nepal-dataset

A structured and organized dataset of the Constitution of Nepal in...

38
Emerging
22 uma-pi1/OPIEC

Reading the data from OPIEC - an Open Information Extraction corpus

37
Emerging
23 Vikhram-S/IndianConstitution

A Python library for exploring the Constitution of India.

37
Emerging
24 csebuetnlp/banglabert

This repository contains the official release of the model "BanglaBERT" and...

36
Emerging
25 cambridgeltl/cometa

Corpus of Online Medical EnTities: the cometA corpus

36
Emerging
26 federicarollo/Italian-Crime-News

A dataset from the Gazzetta di Modena newspaper about crime events in the...

36
Emerging
27 banglanlp/bnlp-resources

Awesome datasets for Bangla language computing.

35
Emerging
28 zhanlaoban/NLP_PEMDC

NLP Predtrained Embeddings, Models and Datasets Collections(NLP_PEMDC). The...

34
Emerging
29 UsmanNiazi/DUC-2004-Dataset

This Repo Contains the DUC 2004 Dataset

34
Emerging
30 jacklanda/CCAE

[NLPCC 2023] CCAE: A Corpus of Chinese-based Asian Englishes

33
Emerging
31 MuhammadYaseenKhan/Urdu-Sentiment-Corpus

Labelled Dataset for Urdu Sentiment Analysis

32
Emerging
32 anoopkunchukuttan/meteor_indic

METEOR for Indian languages (originally forked from METEOR 1.4)

31
Emerging
33 mussacharles60/swahili-dictionary

Swahili dictionary for implementing in your projects

31
Emerging
34 Sueza-project/Sueza_project

Linguistic database collection for the revitalization of Cameroonian local...

31
Emerging
35 s-bose/Walks-into-a-bar-dataset

A dataset containing 1000+ walks-into-a-bar jokes scraped from the internet.

31
Emerging
36 crux82/huric

HuRIC 2.0 - the Human Robot Interaction Corpus

31
Emerging
37 lanwuwei/Twitter-URL-Corpus

Large scale sentential paraphrases collection and annotation

30
Emerging
38 Riccorl/nlp-dataset-readers

Readers for NLP Datasets

30
Emerging
39 EthioNLP/Resource

This repository contains research papers and datasets for different NLP...

30
Emerging
40 Andrews2017/africanlp-public-datasets

A repository for publicly/freely available Natural Language Processing (NLP)...

30
Emerging
41 hrgupta/indian-scriptures

This repository contains various Indian scriptures 📜 in a structured .csv...

29
Experimental
42 UKPLab/useb

Heterogenous, Task- and Domain-Specific Benchmark for Unsupervised Sentence...

29
Experimental
43 mrpeerat/Thai-Sentence-Vector-Benchmark

Benchmark for Thai sentence representation

29
Experimental
44 kili-technology/awesome-datasets

A comprehensive list of annotated training datasets classified by use case.

29
Experimental
45 t-systems-on-site-services-gmbh/german-elmo-model

This is a german ELMo deep contextualized word representation. It is trained...

27
Experimental
46 Hironsan/wiki-article-dataset

Wikipedia article dataset

27
Experimental
47 mapmeld/hindi-bert

Hindi NLP work

27
Experimental
48 COS301-SE-2025/Mafoko

Mafoko is a progressive web app (PWA) that provides access to multilingual...

27
Experimental
49 maxent-ai/Datasets

datasets with text data for use in NLP, Text analysis, information...

27
Experimental
50 kassemsabeh/open-brand

The dataset contains over 250k product brand-value annotations with more...

27
Experimental
51 massanishi/hackernews-post-datasets

Datasets for hackernews posts

27
Experimental
52 reem-codes/ArMATH

ArMATH: The Arabic Math Word Problem dataset. Accepted in LREC2022

27
Experimental
53 SuzanaK/language_datasets

Language Datasets for NLP, Machine Learning, and Map Creation

26
Experimental
54 hyunwoongko/nlp-datasets

Curation note of NLP datasets

26
Experimental
55 Pogayo/Luo-News-Dataset

This repo contains LUO corpus for Named Entity Recognition. The text comes...

26
Experimental
56 aalok-sathe/sentspace

a module to obtain diverse real-world-grounded features for sentences for...

26
Experimental
57 quality-attributes/datasets

Official data sources for the Quality Attributes project

25
Experimental
58 OumaimaHourrane/MA_Open_Datasets

Moroccan NLP Datasets and Corpora

23
Experimental
59 nlp-waseda/comet-atomic-ja

COMET-ATOMIC ja

23
Experimental
60 filbench/filbench-eval

Experiments and Analyses for FilBench: An Open LLM Leaderboard for Filipino...

22
Experimental
61 VLa-Labs/Danish-Language-Dataset-List

A curated metadata collection of 31 publicly available Danish language datasets.

22
Experimental
62 megagonlabs/ebe-dataset

Evidence-based Explanation Dataset (AACL-IJCNLP 2020)

22
Experimental
63 aviaefrat/cryptonite

The Official Repository of the Cryptonite Dataset

21
Experimental
64 pln-fing-udelar/humor

HUMOR dataset for humor research

21
Experimental
65 ART-Group-it/GASP

GASP! Dataset - Generating Abstracts of Scientific Papers from Abstracts of...

21
Experimental
66 mohansaidinesh/Datasets

Datasets for Machine Learning

21
Experimental
67 sniperx-19/awesome-sentence-embedding

A curated list of pretrained sentence and word embedding models

20
Experimental
68 Niger-Volta-LTI/urhobo-text

Urhobo language training text for NLP, ASR and TTS tasks

20
Experimental
69 kaisugi/datasets-for-sequential-sentence-classification

Curated list of public datasets which focus on sentence classification in...

20
Experimental
70 ICPSR/dataset-references

NER pipeline to detect dataset references for ASIST 2022 paper

19
Experimental
71 dsfsi/PuoData

Curated corpora for Setswana. Used to train PuoBERTa.

19
Experimental
72 OpenCENIA/SRN

Spanish Resources and Evaluation

19
Experimental
73 KushtrimVisoka/Kosovo-Parliament-Transcriptions

NOTE: The dataset is maintained exclusively on HuggingFace Datasets. The...

19
Experimental
74 dsfsi/project-state-capture

Zondo Commission or State Capture Commission Transcripts

19
Experimental
75 jonas-becker/pd-human-vs-machine-content

The official repository for the paper "Paraphrase Detection: Human vs....

19
Experimental
76 slvnwhrl/sigmorphon2022-models

This repository contains the models used by the CLUZH team for the...

19
Experimental
77 bluechoochoo/retired_comedy_phrases

A Casual Spreadsheets resource

19
Experimental
78 Archaeocomputers/Bessarion

A text and imaging dataset of Byzantine-era Medieval Greek inscriptions.

19
Experimental
79 createmomo/supporting-comedy-writers

Predicting Audience’s Response from Sketch Comedy and Crosstalk Scripts (A...

19
Experimental
80 mzmmoazam/kashmiri_dataset

Data and tool to fetch kashmiri text

19
Experimental
81 NetworkTheoryAppliedResearchInstitute/anthropology-

Comprehensive AI training corpus for anthropology education: 580K tokens...

19
Experimental
82 CyberAgentAILab/AdParaphrase

This repository contains data for our paper "AdParaphrase: Paraphrase...

19
Experimental
83 radi-cho/noisy-sentences-dataset

550K sentences in 5 European languages augmented with noise for training and...

18
Experimental
84 NoelShallum/all-indian-acts

Repository containing all Indian Acts and statutes in the PDF and txt...

18
Experimental
85 rmdodhia/dataset-detection

Detects datasets used in journal papers

18
Experimental
86 dsfsi/zabantu-beta

ZaBantu is a fleet of light-weight Masked Language Models for Southern Bantu...

18
Experimental
87 metriccoders/metriccoders_datasets

This is the Metric Coders repository containing all the datasets for machine...

17
Experimental
88 felixgiov/public-meeting

Dataset from the paper "Information Extraction from Public Meeting Articles"

17
Experimental
89 jahidulzaid/BanglaNostalgia

A benchmark and training pipeline for detecting nostalgia in Bangla text....

17
Experimental
90 BrianMsane/siSwati-Datasets

Repository for siSwati NLP datasets which I have worked on in my research....

17
Experimental
91 davidwarrior22/machine-translation-for-african-languages

This repository focuses on developing machine translation and NLP tools...

14
Experimental
92 sanjanalreddy/NLP-Datasets

List of NLP Datasets

13
Experimental
93 stefan-it/gc4lm

GC4LM: A Colossal (Biased) language model for German

13
Experimental
94 mikesdatawork/ai-ml-datasets-hub

Curated collection of high-quality datasets optimized for AI/ML pipelines,...

12
Experimental
95 zyuanlim/singlish-manglish-nlp

Resources for Singlish and Manglish NLP.

12
Experimental
96 Mufassir-Chowdhury/BnPC

This is the official repository of the paper titled "BnPC: A Gold Standard...

11
Experimental
97 MusfiqDehan/bn-en-aligner

Tool to easily align Bangla and English words from sentences

11
Experimental
98 Unipisa/admin-It

Dataset for automatic readability assessment and text simplification of...

11
Experimental
99 cvjena/chiasmus-annotations

German Chiasmus Dataset

11
Experimental
100 Aman-byte1/amharic-conversation-and-math-dataset

የቁጥር ምላሾች የተሰጡባቸውን የአማርኛ ቃላዊ ጥያቄዎች እና በእንግሊዝኛ እና በአማርኛ የተደረጉ የውይይት ልውውጦችን...

10
Experimental
101 shercostiniano/filipino-stoytelling-ner

Open-source repository for our paper in Thesis 1

10
Experimental