kili-technology/awesome-datasets
A comprehensive list of annotated training datasets classified by use case.
This resource provides a curated list of high-quality, pre-annotated datasets for various real-world AI applications. It helps data scientists and AI practitioners find suitable data for tasks like speech recognition, document processing (e.g., classifying invoices, extracting information from contracts), and image analysis (e.g., medical image segmentation). You can input your specific problem area and receive a list of relevant datasets, often with previews and links to the data.
No commits in the last 6 months.
Use this if you are an AI practitioner, data scientist, or researcher looking for readily available, annotated datasets to train or evaluate machine learning models for document processing, speech recognition, or image analysis.
Not ideal if you need to create custom annotations for your own unique data, or if you are looking for general-purpose, unannotated raw data.
Stars
38
Forks
6
Language
—
License
—
Category
Last pushed
Jul 08, 2022
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/kili-technology/awesome-datasets"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
acl-org/acl-anthology
Data and software for building the ACL Anthology.
anoopkunchukuttan/indic_nlp_library
Resources and tools for Indian language Natural Language Processing
CLUEbenchmark/CLUECorpus2020
Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料
KennethEnevoldsen/scandinavian-embedding-benchmark
A Scandinavian Benchmark for sentence embeddings
Separius/awesome-sentence-embedding
A curated list of pretrained sentence and word embedding models