datajuicer/data-juicer

Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷

61
/ 100
Established

This tool helps AI engineers and data scientists prepare massive, messy datasets for training foundation models. It takes raw text, images, audio, or video and provides modular, scalable operations to clean, filter, and transform the data. The output is high-quality, AI-ready data suitable for tasks like pre-training large language models or building agent systems.

6,051 stars. Actively maintained with 10 commits in the last 30 days.

Use this if you need to process vast amounts of diverse, raw data into a clean, structured format for AI model development, especially for large-scale foundation models or agent systems.

Not ideal if your data processing needs are small-scale, involve simple spreadsheet clean-up, or don't require advanced multimodal operations.

AI-data-preparation foundation-model-training large-scale-data-curation multimodal-data-processing MLOps-data-engineering
No Package No Dependents
Maintenance 17 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 18 / 25

How are scores calculated?

Stars

6,051

Forks

339

Language

Python

License

Apache-2.0

Last pushed

Mar 13, 2026

Commits (30d)

10

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/mlops/datajuicer/data-juicer"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.