Data Quality Preprocessing ML Frameworks
Tools and techniques for assessing, cleaning, and preparing datasets for machine learning. Includes data validation, outlier detection, missing value handling, and dataset quality frameworks. Does NOT include domain-specific cleaning (e.g., text-only or image-only), general data science tutorials without code frameworks, or downstream ML modeling tasks.
There are 100 data quality preprocessing frameworks tracked. 3 score above 70 (verified tier). The highest-rated is skrub-data/skrub at 78/100 with 1,578 stars. 5 of the top 10 are actively maintained.
Get all 100 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=ml-frameworks&subcategory=data-quality-preprocessing&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Framework | Score | Tier |
|---|---|---|---|
| 1 |
skrub-data/skrub
Machine learning with dataframes |
|
Verified |
| 2 |
biolab/orange3
🍊 :bar_chart: :bulb: Orange: Interactive data analysis |
|
Verified |
| 3 |
root-project/root
The official repository for ROOT: analyzing, storing and visualizing big... |
|
Verified |
| 4 |
cleanlab/cleanlab
Cleanlab's open-source library is the standard data-centric AI package for... |
|
Established |
| 5 |
drivendataorg/deon
A command line tool to easily add an ethics checklist to your data science projects. |
|
Established |
| 6 |
deepnote/deepnote
Deepnote is a drop-in replacement for Jupyter with an AI-first design, sleek... |
|
Established |
| 7 |
rhiever/datacleaner
A Python tool that automatically cleans data sets and readies them for analysis. |
|
Established |
| 8 |
Renumics/spotlight
Interactively explore unstructured datasets from your dataframe. |
|
Established |
| 9 |
JasonKessler/scattertext
Beautiful visualizations of how language differs among document types. |
|
Established |
| 10 |
fbdesignpro/sweetviz
Visualize and compare datasets, target values and associations, with one... |
|
Established |
| 11 |
Data-Centric-AI-Community/ydata-quality
Data Quality assessment with one line of code |
|
Established |
| 12 |
bodo-ai/PyDough
Analytics DSL for Python |
|
Established |
| 13 |
MPEDS/mpeds
Machine-learning Protest Event Data System |
|
Established |
| 14 |
IRT-SystemX/dqm-ml
A library to compute data quality metrics |
|
Established |
| 15 |
COM6012/ScalableML
COM6012 Scalable Machine Learning - University of Sheffield. Enjoy our... |
|
Established |
| 16 |
ShimantoRahman/empulse
Value-driven and cost-sensitive analysis for scikit-learn |
|
Established |
| 17 |
deepnote/deepnote-toolkit
Essential Python toolkit for Deepnote environments |
|
Established |
| 18 |
altermarkive/shrubbery
Numerai Experiments |
|
Established |
| 19 |
AutoViML/pandas_dq
Find data quality issues and clean your data in a single line of code with a... |
|
Established |
| 20 |
msamogh/nonechucks
Deal with bad samples in your dataset dynamically, use Transforms as... |
|
Emerging |
| 21 |
buabaj/xplore
A python package built for data scientist/analysts, AI/ML engineers for... |
|
Emerging |
| 22 |
Olow304/Data-Science-Machine-Learning
The overall objective of this toolkit is to provide and offer a free... |
|
Emerging |
| 23 |
scienxlab/redflag
Safety net for machine learning pipelines. Plays nice with sklearn and pandas. |
|
Emerging |
| 24 |
gretl-project/gretl
Official mirror of the actively maintained repo on sourceforge |
|
Emerging |
| 25 |
JacksonBurns/astartes
Better Data Splits for Machine Learning |
|
Emerging |
| 26 |
SERG-Delft/dslinter
`dslinter` is a pylint plugin for linting data science and machine learning... |
|
Emerging |
| 27 |
PAIR-code/facets
Visualizations for machine learning datasets |
|
Emerging |
| 28 |
ml-tooling/ml-workspace
🛠 All-in-one web-based IDE specialized for machine learning and data science. |
|
Emerging |
| 29 |
matthewfeickert-talks/reproducible-ml-for-scientists-with-pixi-scipy-2025
SciPy 2025 tutorial on "Reproducible Machine Learning Workflows for... |
|
Emerging |
| 30 |
fusion-jena/MLProvLab
Provenance Management for Data Science Notebooks |
|
Emerging |
| 31 |
Livingston-k/cleanPyData
cleanPyData is a Python package for data cleaning and preprocessing. It... |
|
Emerging |
| 32 |
PKNU-PR-ML-Lab/orange
오렌지로 쉽게 배우는 머신러닝과 데이터 분석 (오렌지3) |
|
Emerging |
| 33 |
France-Travail/gabarit
Gabarit : kickstart your data science project from scratch |
|
Emerging |
| 34 |
HelikarLab/candis
:ribbon: A data mining suite for gene expression data. |
|
Emerging |
| 35 |
microsoft/Data-Discovery-Toolkit
A data discovery and manipulation toolset for unstructured data |
|
Emerging |
| 36 |
cssr-tools/ML_near_well
Runfiles for an ML near-well model and to reproduce results from the article... |
|
Emerging |
| 37 |
pierpaolo28/Data-Visualization
Collection of interactive Jupiter Notebook widgets and graphs. |
|
Emerging |
| 38 |
genular/pandora
PANDORA :computer: |
|
Emerging |
| 39 |
Safe-DS/Stub-Generator
Automated generation of Safe-DS stubs for Python libraries. |
|
Emerging |
| 40 |
Digital-Dermatology/SelfClean
[NeurIPS 2024] 🧼🔎 A holistic self-supervised data cleaning strategy to... |
|
Emerging |
| 41 |
HazyResearch/meerkat
Explore and understand your training and validation data. |
|
Emerging |
| 42 |
Renumics/sliceguard
A library for detecting problematic data segments in structured and... |
|
Emerging |
| 43 |
cdr-book/cdr-book.github.io
Repository for the website of the book (github hosting support) |
|
Emerging |
| 44 |
khuyentran1401/reproducible-data-science
Tutorials on creating a reproducible and maintainable data science project |
|
Emerging |
| 45 |
ThomasWong2022/numerai-benchmark
Python Code used in publications, for archival purposes only |
|
Emerging |
| 46 |
pmaji/data-science-toolkit
Collection of stats, modeling, and data science tools in Python and R. |
|
Emerging |
| 47 |
seedatnabeel/Data-IQ
Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular... |
|
Emerging |
| 48 |
dirty-data-science/python
Tutorial material on machine learning with dirty data in Python |
|
Emerging |
| 49 |
BinaryResearch/centrifuge-toolkit
Tool for visualizing and empirically analyzing information encoded in binary files |
|
Emerging |
| 50 |
synapticore-io/marimo-flow
Interactive ML notebooks with reactive updates, AI assistance, and MLflow tracking |
|
Emerging |
| 51 |
CharlesAverill/satyrn
A Notebook alternative that supports branching code and local collaboration. |
|
Emerging |
| 52 |
councilofelders/numereval
A small library to locally calculate the scores on numer.ai tournament's... |
|
Emerging |
| 53 |
seedatnabeel/Data-SUITE
Data-SUITE: Data-centric identification of in-distribution incongruous... |
|
Emerging |
| 54 |
fuseml/examples
A collection of machine learning projects serving as sample applications... |
|
Emerging |
| 55 |
gianlucatruda/numerai
Quant. trading with ML on Numerai |
|
Emerging |
| 56 |
awojinrin/ML-Workflow-for-the-Determination-of-Hole-Cleaning-Conditions
A repo containing Jupyter notebooks where ensemble algorithms are... |
|
Emerging |
| 57 |
sumanthprabhu/DQC-Toolkit
Quality Checks for Training Data in Machine Learning |
|
Emerging |
| 58 |
adipolak/scaling-machine-learning-course
Scaling Machine Learning in Three Week course in a collaboration with... |
|
Emerging |
| 59 |
ahmedshahriar/PulsePoint-Data-Analytics
EDA, data processing, cleaning and extensive geospatial analysis on a... |
|
Emerging |
| 60 |
numerai/signals-example-scripts
The official example scripts for the Numerai Signals Data Science Tournament |
|
Emerging |
| 61 |
NatanMish/data_validation
Tutorial for implementing data validation in data science pipelines |
|
Emerging |
| 62 |
sultanul-ovi/GPU-Cluster-Spot-Resource-Dataset-Analysis
Detailed Analysis Traces for AI jobs leveraging spot GPU resources |
|
Emerging |
| 63 |
s-kav/ds_tools
Library consisting of additional & helpful functions for data science research stages |
|
Emerging |
| 64 |
sbettid/GPSClean
An application to correct a GPS trace using machine learning techniques. To... |
|
Emerging |
| 65 |
galafis/awesome-data-science-toolkit
🚀 Comprehensive toolkit for data scientists with Python utilities, ML... |
|
Experimental |
| 66 |
KaziAmitHasan/data-inspector
Data Inspector is an open-source python library that brings 15++ types of... |
|
Experimental |
| 67 |
iterative/example-gto
Get Started GTO Project |
|
Experimental |
| 68 |
ELHoussineT/AutoDataCleaner
Simple and automatic data cleaning in one line of code! It performs one-hot... |
|
Experimental |
| 69 |
sarwarbeing-ai/Scaler
Scaler:Study Materials for Data Science and Machine Learning |
|
Experimental |
| 70 |
LEL-A/GerAlpacaDataCleaned
German Alpaca Dataset (Cleaned + Translated) |
|
Experimental |
| 71 |
pawlyk/dsml-tools
set of Data Science and Machine Learning tools |
|
Experimental |
| 72 |
chiphuyen/metaflow-transformers-tutorials
Metaflow tutorials for ODSC West 2021 |
|
Experimental |
| 73 |
sturlese/numerai_signals_pipeline
Downloads data from Yahoo Finance, generates features, trains a model and... |
|
Experimental |
| 74 |
Diogolsn10/statistical-analysis
Provide well-documented statistical analysis tools in Python, R, and Stata... |
|
Experimental |
| 75 |
Vis4Sense/ml-prov-binder
The code for running our Jupyter Lab extension on https://mybinder.org/ |
|
Experimental |
| 76 |
berkaygediz/SolidSheets
📊 A modern spreadsheet editor with ML integration, supporting real-time... |
|
Experimental |
| 77 |
akashmi/ai-data-engineering-ecosystem-guide
A comprehensive reference guide mapping the entire AI, Machine Learning,... |
|
Experimental |
| 78 |
NimoKwarkye/stats_tool_repo
XploreML is node based application built with dearpygui. This application... |
|
Experimental |
| 79 |
AliAmini93/Data-Distribution-Finder
Developed a Windows-based app for analyzing data distributions and... |
|
Experimental |
| 80 |
yuliu625/Yu-Data-Science-Toolkit
A modular data science toolkit for scientific research, featuring... |
|
Experimental |
| 81 |
virbahu/dmaic-toolkit
Lean Six Sigma DMAIC toolkit statistical tests |
|
Experimental |
| 82 |
RezaMoammadi/Book-Data-Science-R
If you're eager to explore data science, data analysis, and machine... |
|
Experimental |
| 83 |
SakuraPuare/AlibabaTrace
阿里集群数据集cluster-trace-v2018分析及可视化系统的设计与实现 |
|
Experimental |
| 84 |
kjd-dktech/ml-data-analysis-pipeline
Analyse Exploratoire et Modélisation de Données – Cadre Académique |
|
Experimental |
| 85 |
sultanul-ovi/Alibaba-GPU-Cluster-Dataset-2025-Analysis
Detailed Analysis Traces for GPU-Disaggregated Deep Learning Recommendation Models |
|
Experimental |
| 86 |
rimonim/ds4psych
Data Science for Psychology: Natural Language |
|
Experimental |
| 87 |
fhswf/paper-mlwa-mlpro-2.0
Paper ScienceDirect MLWA - Arend e.a. - "MLPro 2.0 - Online machine learning... |
|
Experimental |
| 88 |
GZ30eee/DataVerse
DataVerse is an innovative platform that empowers users with advanced data... |
|
Experimental |
| 89 |
NERC-CEH/DSFP-PyExplorer
A Python package for doing exploratory data analysis of collections on the... |
|
Experimental |
| 90 |
FixML/FixML_Paper
A repository for developing a paper focused on the FixML system. |
|
Experimental |
| 91 |
lareadeola/CleanTweet
CleanTweet is a python library created for cleaning textual data fetched from an API. |
|
Experimental |
| 92 |
LTxYan/Data-Reliability-Noisy-Input-Handling-in-ML-Models
🔍 Analyze how noisy and incomplete data impacts machine learning model... |
|
Experimental |
| 93 |
lamastex/ScaDaMaLe
Scalable Data Science and Distributed Machine Learning Course Book written... |
|
Experimental |
| 94 |
garimamittal13/SMAI-M25
Data analysis, statistical modeling, clustering, forecasting, and deep... |
|
Experimental |
| 95 |
jyhuang201900/Orange-Engine
Integrate the Orange Engine with ease using our free library and sample... |
|
Experimental |
| 96 |
nguyencongtri/data12
🚀 Build scalable enterprise applications with a robust architecture that... |
|
Experimental |
| 97 |
tauseef1234/Python_Starter_Toolkit
Python toolkit to get started with data science and machine learning projects |
|
Experimental |
| 98 |
mentoratechnologies/PurifyFactory-Beta
PurifyFactory v9.1.6 — Programma Beta Betatester |
|
Experimental |
| 99 |
rampal-punia/data-science-toolkit
Your Go-To Resource for Essential Data Science Related Commands, Concepts,... |
|
Experimental |
| 100 |
burning-issues/burn_iss-website
The repository supporting our webpage |
|
Experimental |