kevin-hanselman/dud

A lightweight CLI tool for versioning data alongside source code and building data pipelines.

/ 100

Emerging

This tool helps data professionals manage large files and directories, like datasets or models, alongside their project code. It allows you to commit, checkout, fetch, and push these large data assets using simple commands. Data scientists, machine learning engineers, and even digital designers can use this to keep their data in sync with their code versions.

219 stars. No commits in the last 6 months.

Use this if you need a fast, lightweight way to version large data files and build data pipelines, especially if you prioritize speed and simplicity over an all-in-one machine learning platform.

Not ideal if you need integrated experiment tracking, metric logging, or a 'batteries-included' suite of tools for an entire machine learning workflow.

data-versioning data-pipelines data-management MLOps reproducibility

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 9 / 25

How are scores calculated?

Stars

219

Forks

Language

License

BSD-3-Clause

Higher-rated alternatives

mage-ai/mage-ai

🧙 Build, run, and manage data pipelines for integrating and transforming data.

vaexio/vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of...

alibaba/feathub

FeatHub - A stream-batch unified feature store for real-time machine learning

mindsdb/dbt-mindsdb

dbt adapter for connecting to MindsDB

Bread-Technologies/Bread-Dataset-Viewer

VS Code extension to easily view and handle large datasets. Look at JSONL/Parquet/CSV files...

Explore Data Engineering Tools

All categories Trending Data Engineering directory Insights