datajuicer/data-juicer

Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷

/ 100

Established

This tool helps AI engineers and data scientists prepare massive, messy datasets for training foundation models. It takes raw text, images, audio, or video and provides modular, scalable operations to clean, filter, and transform the data. The output is high-quality, AI-ready data suitable for tasks like pre-training large language models or building agent systems.

6,051 stars. Actively maintained with 10 commits in the last 30 days.

Use this if you need to process vast amounts of diverse, raw data into a clean, structured format for AI model development, especially for large-scale foundation models or agent systems.

Not ideal if your data processing needs are small-scale, involve simple spreadsheet clean-up, or don't require advanced multimodal operations.

AI-data-preparation foundation-model-training large-scale-data-curation multimodal-data-processing MLOps-data-engineering

No Package No Dependents

Maintenance 17 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 18 / 25

How are scores calculated?

Stars

6,051

Forks

339

Language

Python

License

Apache-2.0

Recent Releases

v1.5.1 17 Mar 2026 v1.5.0 26 Feb 2026 v1.4.6 02 Feb 2026 v1.4.5 13 Jan 2026 v1.4.4 01 Dec 2025

Related tools

dermatologist/pyomop

Python package for managing OHDSI clinical data models. Includes support for LLM based plain...

duoan/mega-data-factory

🏭 Mega Scale Multimodal DataPipeline for SOTA Foundation Models

Explore MLOps Tools

All categories Trending MLOps directory Insights