Open Dataset Collections ML Frameworks
Curated repositories and directories that aggregate, catalog, or provide access to multiple datasets across various domains. Does NOT include individual datasets, dataset generation tools, or domain-specific dataset papers.
There are 70 open dataset collections frameworks tracked. 2 score above 70 (verified tier). The highest-rated is open-edge-platform/datumaro at 80/100 with 661 stars. 3 of the top 10 are actively maintained.
Get all 70 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=ml-frameworks&subcategory=open-dataset-collections&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Framework | Score | Tier |
|---|---|---|---|
| 1 |
open-edge-platform/datumaro
Dataset Management Framework, a Python library and a CLI tool to build,... |
|
Verified |
| 2 |
explosion/ml-datasets
🌊 Machine learning dataset loaders for testing and example scripts |
|
Verified |
| 3 |
webdataset/webdataset
A high-performance Python-based I/O system for large (and small) deep... |
|
Established |
| 4 |
tensorflow/datasets
TFDS is a collection of datasets ready to use with TensorFlow, Jax, ... |
|
Established |
| 5 |
mlcommons/croissant
Croissant is a high-level format for machine learning datasets that brings... |
|
Established |
| 6 |
alan-turing-institute/CleverCSV
CleverCSV is a Python package for handling messy CSV files. It provides a... |
|
Established |
| 7 |
JovianHQ/opendatasets
A Python library for downloading datasets from Kaggle, Google Drive, and... |
|
Established |
| 8 |
benedekrozemberczki/datasets
A repository of pretty cool datasets that I collected for network science... |
|
Established |
| 9 |
src-d/datasets
source{d} datasets ("big code") for source code analysis and machine... |
|
Emerging |
| 10 |
opengeos/aws-open-data
A list of open datasets on AWS |
|
Emerging |
| 11 |
foorilla/ai-jobs-net-salaries
A dataset of global salaries in AI/ML and Big Data. |
|
Emerging |
| 12 |
packing-box/python-dsff
DataSet File Format (DSFF) |
|
Emerging |
| 13 |
Pinak-Datta/wiz-craft
A CLI-based dataset preprocessing tool for machine learning tasks. Features... |
|
Emerging |
| 14 |
jbrownlee/Datasets
Machine learning datasets used in tutorials on MachineLearningMastery.com |
|
Emerging |
| 15 |
CYang828/datasetstation
快速下载中文数据集,处理数据集,数据分析、可视化分析,一站式解决数据问题 |
|
Emerging |
| 16 |
cleanlab/label-errors
🛠️ Corrected Test Sets for ImageNet, MNIST, CIFAR, Caltech-256, QuickDraw,... |
|
Emerging |
| 17 |
osdg-ai/osdg-data
The OSDG Community Dataset (OSDG-CD) is a public dataset of thousands of... |
|
Emerging |
| 18 |
samz5320/Data4ALL
A spot for all the datasets you need. |
|
Emerging |
| 19 |
BaumSebastian/DDACS
Python interface for the DDACS dataset: 32K+ deep drawing simulations with... |
|
Emerging |
| 20 |
unsplash/datasets
🎁 6,500,000+ Unsplash images made available for research and machine learning |
|
Emerging |
| 21 |
SigmaJahan/Textual-Dissimilarity-Analysis-for-Duplicate-Bug-Report-Detection
We conduct a large-scale empirical study to understand better the impacts of... |
|
Emerging |
| 22 |
Vatshayan/Data-sets
Different Data-set on various Important topic on Real-world Problems |
|
Emerging |
| 23 |
Intelligent-CAT-Lab/SEER
Artifact repository for the paper "Perfect Is the Enemy of Test Oracle", In... |
|
Emerging |
| 24 |
shreyashankar/datasets-for-good
List of datasets to apply stats/machine learning/technology to the world of... |
|
Emerging |
| 25 |
yongfanbeta/Open-Access-Medical-Data
A list of Open-Access-Medical-Data(OAMD) commonly used in medical research |
|
Emerging |
| 26 |
seart-group/DL4SE
Building Training Datasets for Deep Learning Models in Software Engineering... |
|
Emerging |
| 27 |
AdaptInfer/CompBioDatasetsForMachineLearning
A Curated List of Computational Biology Datasets Suitable for Machine Learning |
|
Emerging |
| 28 |
fossology/Minerva-Dataset-Generation
Validated dataset generation using regex along with NLP Algorithms. |
|
Emerging |
| 29 |
simula/datasets.simula.no
Public datasets published by Simula. |
|
Emerging |
| 30 |
modelset/modelset-dataset
ModelSet is a labelled dataset of Ecore and UML models |
|
Emerging |
| 31 |
anwielts/datasheet-for-dataset
Automatically create standardized documentation for the dataset used in your... |
|
Emerging |
| 32 |
DagsHub/3D-model-datasets
Open-source 3D Model datasets |
|
Emerging |
| 33 |
asampat3090/open-datasets
Running list of Open Datasets |
|
Emerging |
| 34 |
DagsHub/open-source-ml-datasets
This repository holds open source datasets for various machine learning... |
|
Emerging |
| 35 |
salesforce/iSEA
Official code repository for "iSEA: An Interactive Pipeline of Semantic... |
|
Experimental |
| 36 |
AhmedBella/World-Dataset-Library
A Django/React website for sharing datasets - It seems that we got beaten to... |
|
Experimental |
| 37 |
skforecast/skforecast-datasets
This repository contains datasets used in the skforecast library. It also... |
|
Experimental |
| 38 |
autonlab/aqua
AQuA: A Benchmarking Tool for Label Quality Assessment, NeurIPS'23 D&B |
|
Experimental |
| 39 |
QQQHY/Medical-Datasets-for-Machine-Learning
Medical Datasets for Machine Learning 机器学习医学数据 |
|
Experimental |
| 40 |
MainakVerse/Datasets
List of ready to use datasets for your projects |
|
Experimental |
| 41 |
incubrain/awesome-maharashtra-data
A collection of datasets specific to Maharashtra, India. WIP |
|
Experimental |
| 42 |
ZamAI-ORG/training-spaces
Reusable training and experiment spaces for ZamAI Labs (templates, scripts,... |
|
Experimental |
| 43 |
ZamAI-ORG/pashto-datasets
Curated and processed Pashto datasets for ZamAI Labs (with source... |
|
Experimental |
| 44 |
ZamAI-ORG/mt5-pashto
Pashto-focused work with mT5 (experiments, fine-tuning, references) in ZamAI Labs. |
|
Experimental |
| 45 |
mdrmdmau/datasets
📊 Gather and share open datasets for the Indonesian physics community,... |
|
Experimental |
| 46 |
samuelmcnair33/Samuel-McNair-Dataset
Personal dataset released under CC0 license |
|
Experimental |
| 47 |
aoerecinfo/aoe2dataset
Age of Empires II Definitive Edition rec analysis dataset |
|
Experimental |
| 48 |
lucien1011/MoonBoard-Route
Dataset for Moonboard routes with 2016 and 2017 setting, scraped in 2018. |
|
Experimental |
| 49 |
mlnjsh/EIF-Training-Datasets
📁 Curated Training Datasets — Clean, labeled datasets for ML/AI coursework... |
|
Experimental |
| 50 |
ZamAI-ORG/zamai-models
Model artifacts and experiments published by ZamAI Labs (training results... |
|
Experimental |
| 51 |
ZamAI-ORG/labs
ZamAI Labs — datasets, Pashto processing, models, and training pipelines... |
|
Experimental |
| 52 |
coneco-lab/open-lab-toolkit
A collection of CoN&Co Lab's main software tools |
|
Experimental |
| 53 |
lkkhwhb/TrainingData
This repository stores and organizes training data folders for machine... |
|
Experimental |
| 54 |
rrsmart8/Product-Deduplication
Machine Learning in Data Preprocessing & Deduplication |
|
Experimental |
| 55 |
muhammadibrahim313/kaggle-databank
Collection of curated datasets from various APIs and sources for data... |
|
Experimental |
| 56 |
komal11lamba/dataset_komal
my dataset collection |
|
Experimental |
| 57 |
vivesweb/csv_pair_file
Manage csv pair files for Machine Learning |
|
Experimental |
| 58 |
serval-uni-lu/The_dataset_of_large_case_studies_on_mutants_similarity_with_bugs
The dataset of large case studies on mutants similarity, measured both... |
|
Experimental |
| 59 |
Knodl-LLC/KnoDL-Match
Service for automatic matching two data sets without mapping |
|
Experimental |
| 60 |
sarfraspc/Famous-Regression-Datasets
A curated collection of famous and widely used datasets for regression... |
|
Experimental |
| 61 |
JeremGamingYT/TrainAIDatasets
This is an AI dataset project with over 10,000-100,000 pieces of data! |
|
Experimental |
| 62 |
GDIAMEL/DATASETS
Unemployment rate datasets to be used in a project |
|
Experimental |
| 63 |
CrispenGari/datasets
📅 This repository contains some datasets for machine learning task. |
|
Experimental |
| 64 |
TheProjectsGuy/DataLines
A project for loading and manipulating datasets with minimal effort |
|
Experimental |
| 65 |
droyed/datatools
Data preparation tools for deep-learning |
|
Experimental |
| 66 |
ZhengLinLei/ZDMP-datasets
Internal datasets for training ZDMP project. |
|
Experimental |
| 67 |
akaakselabhijeet/Data-Science-Datasets-Ver.01
Datasets for data science noobs. |
|
Experimental |
| 68 |
ThalesGroup/oss-datasets
Regroups all Thales Open Source datasets |
|
Experimental |
| 69 |
Csengupta1101/Datasets
This Repository will contain all locally stored datasets in my system |
|
Experimental |
| 70 |
PrAsAnNaRePo/models
get free classifiers / models |
|
Experimental |