Open Dataset Collections ML Frameworks

Curated repositories and directories that aggregate, catalog, or provide access to multiple datasets across various domains. Does NOT include individual datasets, dataset generation tools, or domain-specific dataset papers.

There are 70 open dataset collections frameworks tracked. 2 score above 70 (verified tier). The highest-rated is open-edge-platform/datumaro at 80/100 with 661 stars. 3 of the top 10 are actively maintained.

Get all 70 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=ml-frameworks&subcategory=open-dataset-collections&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Framework Score Tier
1 open-edge-platform/datumaro

Dataset Management Framework, a Python library and a CLI tool to build,...

80
Verified
2 explosion/ml-datasets

🌊 Machine learning dataset loaders for testing and example scripts

74
Verified
3 webdataset/webdataset

A high-performance Python-based I/O system for large (and small) deep...

69
Established
4 tensorflow/datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...

64
Established
5 mlcommons/croissant

Croissant is a high-level format for machine learning datasets that brings...

59
Established
6 alan-turing-institute/CleverCSV

CleverCSV is a Python package for handling messy CSV files. It provides a...

57
Established
7 JovianHQ/opendatasets

A Python library for downloading datasets from Kaggle, Google Drive, and...

57
Established
8 benedekrozemberczki/datasets

A repository of pretty cool datasets that I collected for network science...

51
Established
9 src-d/datasets

source{d} datasets ("big code") for source code analysis and machine...

49
Emerging
10 opengeos/aws-open-data

A list of open datasets on AWS

49
Emerging
11 foorilla/ai-jobs-net-salaries

A dataset of global salaries in AI/ML and Big Data.

48
Emerging
12 packing-box/python-dsff

DataSet File Format (DSFF)

47
Emerging
13 Pinak-Datta/wiz-craft

A CLI-based dataset preprocessing tool for machine learning tasks. Features...

46
Emerging
14 jbrownlee/Datasets

Machine learning datasets used in tutorials on MachineLearningMastery.com

43
Emerging
15 CYang828/datasetstation

快速下载中文数据集,处理数据集,数据分析、可视化分析,一站式解决数据问题

42
Emerging
16 cleanlab/label-errors

🛠️ Corrected Test Sets for ImageNet, MNIST, CIFAR, Caltech-256, QuickDraw,...

42
Emerging
17 osdg-ai/osdg-data

The OSDG Community Dataset (OSDG-CD) is a public dataset of thousands of...

40
Emerging
18 samz5320/Data4ALL

A spot for all the datasets you need.

38
Emerging
19 BaumSebastian/DDACS

Python interface for the DDACS dataset: 32K+ deep drawing simulations with...

38
Emerging
20 unsplash/datasets

🎁 6,500,000+ Unsplash images made available for research and machine learning

38
Emerging
21 SigmaJahan/Textual-Dissimilarity-Analysis-for-Duplicate-Bug-Report-Detection

We conduct a large-scale empirical study to understand better the impacts of...

37
Emerging
22 Vatshayan/Data-sets

Different Data-set on various Important topic on Real-world Problems

37
Emerging
23 Intelligent-CAT-Lab/SEER

Artifact repository for the paper "Perfect Is the Enemy of Test Oracle", In...

36
Emerging
24 shreyashankar/datasets-for-good

List of datasets to apply stats/machine learning/technology to the world of...

36
Emerging
25 yongfanbeta/Open-Access-Medical-Data

A list of Open-Access-Medical-Data(OAMD) commonly used in medical research

36
Emerging
26 seart-group/DL4SE

Building Training Datasets for Deep Learning Models in Software Engineering...

36
Emerging
27 AdaptInfer/CompBioDatasetsForMachineLearning

A Curated List of Computational Biology Datasets Suitable for Machine Learning

35
Emerging
28 fossology/Minerva-Dataset-Generation

Validated dataset generation using regex along with NLP Algorithms.

35
Emerging
29 simula/datasets.simula.no

Public datasets published by Simula.

35
Emerging
30 modelset/modelset-dataset

ModelSet is a labelled dataset of Ecore and UML models

33
Emerging
31 anwielts/datasheet-for-dataset

Automatically create standardized documentation for the dataset used in your...

32
Emerging
32 DagsHub/3D-model-datasets

Open-source 3D Model datasets

31
Emerging
33 asampat3090/open-datasets

Running list of Open Datasets

30
Emerging
34 DagsHub/open-source-ml-datasets

This repository holds open source datasets for various machine learning...

30
Emerging
35 salesforce/iSEA

Official code repository for "iSEA: An Interactive Pipeline of Semantic...

29
Experimental
36 AhmedBella/World-Dataset-Library

A Django/React website for sharing datasets - It seems that we got beaten to...

29
Experimental
37 skforecast/skforecast-datasets

This repository contains datasets used in the skforecast library. It also...

27
Experimental
38 autonlab/aqua

AQuA: A Benchmarking Tool for Label Quality Assessment, NeurIPS'23 D&B

26
Experimental
39 QQQHY/Medical-Datasets-for-Machine-Learning

Medical Datasets for Machine Learning 机器学习医学数据

25
Experimental
40 MainakVerse/Datasets

List of ready to use datasets for your projects

24
Experimental
41 incubrain/awesome-maharashtra-data

A collection of datasets specific to Maharashtra, India. WIP

22
Experimental
42 ZamAI-ORG/training-spaces

Reusable training and experiment spaces for ZamAI Labs (templates, scripts,...

22
Experimental
43 ZamAI-ORG/pashto-datasets

Curated and processed Pashto datasets for ZamAI Labs (with source...

22
Experimental
44 ZamAI-ORG/mt5-pashto

Pashto-focused work with mT5 (experiments, fine-tuning, references) in ZamAI Labs.

22
Experimental
45 mdrmdmau/datasets

📊 Gather and share open datasets for the Indonesian physics community,...

22
Experimental
46 samuelmcnair33/Samuel-McNair-Dataset

Personal dataset released under CC0 license

22
Experimental
47 aoerecinfo/aoe2dataset

Age of Empires II Definitive Edition rec analysis dataset

20
Experimental
48 lucien1011/MoonBoard-Route

Dataset for Moonboard routes with 2016 and 2017 setting, scraped in 2018.

19
Experimental
49 mlnjsh/EIF-Training-Datasets

📁 Curated Training Datasets — Clean, labeled datasets for ML/AI coursework...

14
Experimental
50 ZamAI-ORG/zamai-models

Model artifacts and experiments published by ZamAI Labs (training results...

14
Experimental
51 ZamAI-ORG/labs

ZamAI Labs — datasets, Pashto processing, models, and training pipelines...

14
Experimental
52 coneco-lab/open-lab-toolkit

A collection of CoN&Co Lab's main software tools

13
Experimental
53 lkkhwhb/TrainingData

This repository stores and organizes training data folders for machine...

13
Experimental
54 rrsmart8/Product-Deduplication

Machine Learning in Data Preprocessing & Deduplication

12
Experimental
55 muhammadibrahim313/kaggle-databank

Collection of curated datasets from various APIs and sources for data...

12
Experimental
56 komal11lamba/dataset_komal

my dataset collection

12
Experimental
57 vivesweb/csv_pair_file

Manage csv pair files for Machine Learning

12
Experimental
58 serval-uni-lu/The_dataset_of_large_case_studies_on_mutants_similarity_with_bugs

The dataset of large case studies on mutants similarity, measured both...

12
Experimental
59 Knodl-LLC/KnoDL-Match

Service for automatic matching two data sets without mapping

12
Experimental
60 sarfraspc/Famous-Regression-Datasets

A curated collection of famous and widely used datasets for regression...

12
Experimental
61 JeremGamingYT/TrainAIDatasets

This is an AI dataset project with over 10,000-100,000 pieces of data!

12
Experimental
62 GDIAMEL/DATASETS

Unemployment rate datasets to be used in a project

11
Experimental
63 CrispenGari/datasets

📅 This repository contains some datasets for machine learning task.

11
Experimental
64 TheProjectsGuy/DataLines

A project for loading and manipulating datasets with minimal effort

11
Experimental
65 droyed/datatools

Data preparation tools for deep-learning

11
Experimental
66 ZhengLinLei/ZDMP-datasets

Internal datasets for training ZDMP project.

11
Experimental
67 akaakselabhijeet/Data-Science-Datasets-Ver.01

Datasets for data science noobs.

11
Experimental
68 ThalesGroup/oss-datasets

Regroups all Thales Open Source datasets

11
Experimental
69 Csengupta1101/Datasets

This Repository will contain all locally stored datasets in my system

11
Experimental
70 PrAsAnNaRePo/models

get free classifiers / models

10
Experimental