datachain-ai/datachain
Analytics, Versioning and ETL for multimodal data: video, audio, PDFs, images
DataChain helps data scientists and ML engineers manage and analyze large collections of unstructured data like images, videos, audio, and text. You input raw files stored in cloud storage or local file systems, along with any existing metadata, to create a structured dataset. The output is a versioned, queryable dataset that can be used for analytics, model training, or filtered for export.
2,729 stars. Used by 1 other package. Actively maintained with 39 commits in the last 30 days. Available on PyPI.
Use this if you need to efficiently transform, enrich, and version large, evolving datasets of unstructured multimodal data, especially when integrating with AI models or LLMs.
Not ideal if your primary data consists of structured tables and you don't work with unstructured files like images, videos, or PDFs.
Stars
2,729
Forks
136
Language
Python
License
Apache-2.0
Category
Last pushed
Mar 12, 2026
Commits (30d)
39
Dependencies
36
Reverse dependents
1
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/mlops/datachain-ai/datachain"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.