Data Quality Preprocessing ML Frameworks

Tools and techniques for assessing, cleaning, and preparing datasets for machine learning. Includes data validation, outlier detection, missing value handling, and dataset quality frameworks. Does NOT include domain-specific cleaning (e.g., text-only or image-only), general data science tutorials without code frameworks, or downstream ML modeling tasks.

There are 100 data quality preprocessing frameworks tracked. 3 score above 70 (verified tier). The highest-rated is skrub-data/skrub at 78/100 with 1,578 stars. 5 of the top 10 are actively maintained.

Get all 100 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=ml-frameworks&subcategory=data-quality-preprocessing&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Framework Score Tier
1 skrub-data/skrub

Machine learning with dataframes

78
Verified
2 biolab/orange3

🍊 :bar_chart: :bulb: Orange: Interactive data analysis

77
Verified
3 root-project/root

The official repository for ROOT: analyzing, storing and visualizing big...

73
Verified
4 cleanlab/cleanlab

Cleanlab's open-source library is the standard data-centric AI package for...

66
Established
5 drivendataorg/deon

A command line tool to easily add an ethics checklist to your data science projects.

66
Established
6 deepnote/deepnote

Deepnote is a drop-in replacement for Jupyter with an AI-first design, sleek...

60
Established
7 rhiever/datacleaner

A Python tool that automatically cleans data sets and readies them for analysis.

60
Established
8 Renumics/spotlight

Interactively explore unstructured datasets from your dataframe.

59
Established
9 JasonKessler/scattertext

Beautiful visualizations of how language differs among document types.

58
Established
10 fbdesignpro/sweetviz

Visualize and compare datasets, target values and associations, with one...

58
Established
11 Data-Centric-AI-Community/ydata-quality

Data Quality assessment with one line of code

54
Established
12 bodo-ai/PyDough

Analytics DSL for Python

54
Established
13 MPEDS/mpeds

Machine-learning Protest Event Data System

53
Established
14 IRT-SystemX/dqm-ml

A library to compute data quality metrics

53
Established
15 COM6012/ScalableML

COM6012 Scalable Machine Learning - University of Sheffield. Enjoy our...

53
Established
16 ShimantoRahman/empulse

Value-driven and cost-sensitive analysis for scikit-learn

52
Established
17 deepnote/deepnote-toolkit

Essential Python toolkit for Deepnote environments

52
Established
18 altermarkive/shrubbery

Numerai Experiments

51
Established
19 AutoViML/pandas_dq

Find data quality issues and clean your data in a single line of code with a...

50
Established
20 msamogh/nonechucks

Deal with bad samples in your dataset dynamically, use Transforms as...

49
Emerging
21 buabaj/xplore

A python package built for data scientist/analysts, AI/ML engineers for...

48
Emerging
22 Olow304/Data-Science-Machine-Learning

The overall objective of this toolkit is to provide and offer a free...

47
Emerging
23 scienxlab/redflag

Safety net for machine learning pipelines. Plays nice with sklearn and pandas.

47
Emerging
24 gretl-project/gretl

Official mirror of the actively maintained repo on sourceforge

47
Emerging
25 JacksonBurns/astartes

Better Data Splits for Machine Learning

47
Emerging
26 SERG-Delft/dslinter

`dslinter` is a pylint plugin for linting data science and machine learning...

47
Emerging
27 PAIR-code/facets

Visualizations for machine learning datasets

47
Emerging
28 ml-tooling/ml-workspace

🛠 All-in-one web-based IDE specialized for machine learning and data science.

47
Emerging
29 matthewfeickert-talks/reproducible-ml-for-scientists-with-pixi-scipy-2025

SciPy 2025 tutorial on "Reproducible Machine Learning Workflows for...

46
Emerging
30 fusion-jena/MLProvLab

Provenance Management for Data Science Notebooks

44
Emerging
31 Livingston-k/cleanPyData

cleanPyData is a Python package for data cleaning and preprocessing. It...

43
Emerging
32 PKNU-PR-ML-Lab/orange

오렌지로 쉽게 배우는 머신러닝과 데이터 분석 (오렌지3)

43
Emerging
33 France-Travail/gabarit

Gabarit : kickstart your data science project from scratch

42
Emerging
34 HelikarLab/candis

:ribbon: A data mining suite for gene expression data.

42
Emerging
35 microsoft/Data-Discovery-Toolkit

A data discovery and manipulation toolset for unstructured data

42
Emerging
36 cssr-tools/ML_near_well

Runfiles for an ML near-well model and to reproduce results from the article...

42
Emerging
37 pierpaolo28/Data-Visualization

Collection of interactive Jupiter Notebook widgets and graphs.

42
Emerging
38 genular/pandora

PANDORA :computer:

42
Emerging
39 Safe-DS/Stub-Generator

Automated generation of Safe-DS stubs for Python libraries.

41
Emerging
40 Digital-Dermatology/SelfClean

[NeurIPS 2024] 🧼🔎 A holistic self-supervised data cleaning strategy to...

40
Emerging
41 HazyResearch/meerkat

Explore and understand your training and validation data.

40
Emerging
42 Renumics/sliceguard

A library for detecting problematic data segments in structured and...

40
Emerging
43 cdr-book/cdr-book.github.io

Repository for the website of the book (github hosting support)

40
Emerging
44 khuyentran1401/reproducible-data-science

Tutorials on creating a reproducible and maintainable data science project

39
Emerging
45 ThomasWong2022/numerai-benchmark

Python Code used in publications, for archival purposes only

39
Emerging
46 pmaji/data-science-toolkit

Collection of stats, modeling, and data science tools in Python and R.

38
Emerging
47 seedatnabeel/Data-IQ

Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular...

38
Emerging
48 dirty-data-science/python

Tutorial material on machine learning with dirty data in Python

38
Emerging
49 BinaryResearch/centrifuge-toolkit

Tool for visualizing and empirically analyzing information encoded in binary files

38
Emerging
50 synapticore-io/marimo-flow

Interactive ML notebooks with reactive updates, AI assistance, and MLflow tracking

37
Emerging
51 CharlesAverill/satyrn

A Notebook alternative that supports branching code and local collaboration.

37
Emerging
52 councilofelders/numereval

A small library to locally calculate the scores on numer.ai tournament's...

35
Emerging
53 seedatnabeel/Data-SUITE

Data-SUITE: Data-centric identification of in-distribution incongruous...

35
Emerging
54 fuseml/examples

A collection of machine learning projects serving as sample applications...

34
Emerging
55 gianlucatruda/numerai

Quant. trading with ML on Numerai

34
Emerging
56 awojinrin/ML-Workflow-for-the-Determination-of-Hole-Cleaning-Conditions

A repo containing Jupyter notebooks where ensemble algorithms are...

34
Emerging
57 sumanthprabhu/DQC-Toolkit

Quality Checks for Training Data in Machine Learning

33
Emerging
58 adipolak/scaling-machine-learning-course

Scaling Machine Learning in Three Week course in a collaboration with...

33
Emerging
59 ahmedshahriar/PulsePoint-Data-Analytics

EDA, data processing, cleaning and extensive geospatial analysis on a...

33
Emerging
60 numerai/signals-example-scripts

The official example scripts for the Numerai Signals Data Science Tournament

33
Emerging
61 NatanMish/data_validation

Tutorial for implementing data validation in data science pipelines

32
Emerging
62 sultanul-ovi/GPU-Cluster-Spot-Resource-Dataset-Analysis

Detailed Analysis Traces for AI jobs leveraging spot GPU resources

32
Emerging
63 s-kav/ds_tools

Library consisting of additional & helpful functions for data science research stages

30
Emerging
64 sbettid/GPSClean

An application to correct a GPS trace using machine learning techniques. To...

30
Emerging
65 galafis/awesome-data-science-toolkit

🚀 Comprehensive toolkit for data scientists with Python utilities, ML...

29
Experimental
66 KaziAmitHasan/data-inspector

Data Inspector is an open-source python library that brings 15++ types of...

29
Experimental
67 iterative/example-gto

Get Started GTO Project

29
Experimental
68 ELHoussineT/AutoDataCleaner

Simple and automatic data cleaning in one line of code! It performs one-hot...

29
Experimental
69 sarwarbeing-ai/Scaler

Scaler:Study Materials for Data Science and Machine Learning

29
Experimental
70 LEL-A/GerAlpacaDataCleaned

German Alpaca Dataset (Cleaned + Translated)

27
Experimental
71 pawlyk/dsml-tools

set of Data Science and Machine Learning tools

27
Experimental
72 chiphuyen/metaflow-transformers-tutorials

Metaflow tutorials for ODSC West 2021

27
Experimental
73 sturlese/numerai_signals_pipeline

Downloads data from Yahoo Finance, generates features, trains a model and...

27
Experimental
74 Diogolsn10/statistical-analysis

Provide well-documented statistical analysis tools in Python, R, and Stata...

26
Experimental
75 Vis4Sense/ml-prov-binder

The code for running our Jupyter Lab extension on https://mybinder.org/

25
Experimental
76 berkaygediz/SolidSheets

📊 A modern spreadsheet editor with ML integration, supporting real-time...

24
Experimental
77 akashmi/ai-data-engineering-ecosystem-guide

A comprehensive reference guide mapping the entire AI, Machine Learning,...

23
Experimental
78 NimoKwarkye/stats_tool_repo

XploreML is node based application built with dearpygui. This application...

22
Experimental
79 AliAmini93/Data-Distribution-Finder

Developed a Windows-based app for analyzing data distributions and...

22
Experimental
80 yuliu625/Yu-Data-Science-Toolkit

A modular data science toolkit for scientific research, featuring...

22
Experimental
81 virbahu/dmaic-toolkit

Lean Six Sigma DMAIC toolkit statistical tests

22
Experimental
82 RezaMoammadi/Book-Data-Science-R

If you're eager to explore data science, data analysis, and machine...

21
Experimental
83 SakuraPuare/AlibabaTrace

阿里集群数据集cluster-trace-v2018分析及可视化系统的设计与实现

21
Experimental
84 kjd-dktech/ml-data-analysis-pipeline

Analyse Exploratoire et Modélisation de Données – Cadre Académique

21
Experimental
85 sultanul-ovi/Alibaba-GPU-Cluster-Dataset-2025-Analysis

Detailed Analysis Traces for GPU-Disaggregated Deep Learning Recommendation Models

20
Experimental
86 rimonim/ds4psych

Data Science for Psychology: Natural Language

19
Experimental
87 fhswf/paper-mlwa-mlpro-2.0

Paper ScienceDirect MLWA - Arend e.a. - "MLPro 2.0 - Online machine learning...

19
Experimental
88 GZ30eee/DataVerse

DataVerse is an innovative platform that empowers users with advanced data...

18
Experimental
89 NERC-CEH/DSFP-PyExplorer

A Python package for doing exploratory data analysis of collections on the...

17
Experimental
90 FixML/FixML_Paper

A repository for developing a paper focused on the FixML system.

17
Experimental
91 lareadeola/CleanTweet

CleanTweet is a python library created for cleaning textual data fetched from an API.

15
Experimental
92 LTxYan/Data-Reliability-Noisy-Input-Handling-in-ML-Models

🔍 Analyze how noisy and incomplete data impacts machine learning model...

14
Experimental
93 lamastex/ScaDaMaLe

Scalable Data Science and Distributed Machine Learning Course Book written...

14
Experimental
94 garimamittal13/SMAI-M25

Data analysis, statistical modeling, clustering, forecasting, and deep...

14
Experimental
95 jyhuang201900/Orange-Engine

Integrate the Orange Engine with ease using our free library and sample...

14
Experimental
96 nguyencongtri/data12

🚀 Build scalable enterprise applications with a robust architecture that...

14
Experimental
97 tauseef1234/Python_Starter_Toolkit

Python toolkit to get started with data science and machine learning projects

13
Experimental
98 mentoratechnologies/PurifyFactory-Beta

PurifyFactory v9.1.6 — Programma Beta Betatester

11
Experimental
99 rampal-punia/data-science-toolkit

Your Go-To Resource for Essential Data Science Related Commands, Concepts,...

11
Experimental
100 burning-issues/burn_iss-website

The repository supporting our webpage

11
Experimental

Comparisons in this category