datajuicer/data-juicer
Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷
This tool helps AI engineers and data scientists prepare massive, messy datasets for training foundation models. It takes raw text, images, audio, or video and provides modular, scalable operations to clean, filter, and transform the data. The output is high-quality, AI-ready data suitable for tasks like pre-training large language models or building agent systems.
6,051 stars. Actively maintained with 10 commits in the last 30 days.
Use this if you need to process vast amounts of diverse, raw data into a clean, structured format for AI model development, especially for large-scale foundation models or agent systems.
Not ideal if your data processing needs are small-scale, involve simple spreadsheet clean-up, or don't require advanced multimodal operations.
Stars
6,051
Forks
339
Language
Python
License
Apache-2.0
Category
Last pushed
Mar 13, 2026
Commits (30d)
10
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/mlops/datajuicer/data-juicer"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.