hyunwoongko/nlp-datasets

Curation note of NLP datasets

/ 100

Experimental

This project helps data scientists and NLP researchers find appropriate text datasets for training machine translation and question-answering models. It provides a curated list of publicly available datasets, specifying their languages, types (e.g., multi-lingual, bi-lingual), and estimated volume. The project serves as a practical guide to quickly identify and access relevant text data for their NLP tasks.

No commits in the last 6 months.

Use this if you need to find diverse, multi-lingual, or bi-lingual text datasets for developing machine translation systems or training models to answer questions from text.

Not ideal if you are looking for datasets beyond machine translation or question-answering, or if you need a tool to directly process and prepare the data rather than just discover it.

natural-language-processing machine-translation question-answering text-data-curation ai-dataset-discovery

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 9 / 25

Maturity 8 / 25

Community 9 / 25

How are scores calculated?

Stars

Forks

Language

—

License

—

Higher-rated alternatives

acl-org/acl-anthology

Data and software for building the ACL Anthology.

anoopkunchukuttan/indic_nlp_library

Resources and tools for Indian language Natural Language Processing

CLUEbenchmark/CLUECorpus2020

Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料

KennethEnevoldsen/scandinavian-embedding-benchmark

A Scandinavian Benchmark for sentence embeddings

Separius/awesome-sentence-embedding

A curated list of pretrained sentence and word embedding models

Explore NLP Tools

All categories Trending NLP directory Insights