datachain-ai/datachain

Analytics, Versioning and ETL for multimodal data: video, audio, PDFs, images

73
/ 100
Verified

DataChain helps data scientists and ML engineers manage and analyze large collections of unstructured data like images, videos, audio, and text. You input raw files stored in cloud storage or local file systems, along with any existing metadata, to create a structured dataset. The output is a versioned, queryable dataset that can be used for analytics, model training, or filtered for export.

2,729 stars. Used by 1 other package. Actively maintained with 39 commits in the last 30 days. Available on PyPI.

Use this if you need to efficiently transform, enrich, and version large, evolving datasets of unstructured multimodal data, especially when integrating with AI models or LLMs.

Not ideal if your primary data consists of structured tables and you don't work with unstructured files like images, videos, or PDFs.

MLOps data-warehousing computer-vision natural-language-processing data-versioning
Maintenance 20 / 25
Adoption 11 / 25
Maturity 25 / 25
Community 17 / 25

How are scores calculated?

Stars

2,729

Forks

136

Language

Python

License

Apache-2.0

Last pushed

Mar 12, 2026

Commits (30d)

39

Dependencies

36

Reverse dependents

1

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/mlops/datachain-ai/datachain"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.