duoan/mega-data-factory

🏭 Mega Scale Multimodal DataPipeline for SOTA Foundation Models

/ 100

Emerging

This is a powerful toolkit for researchers and engineers who are building the next generation of AI models, like large language models or image generators. It helps you collect massive amounts of raw internet data (text, images, video) and transform it into high-quality training datasets. You input vast, messy web data, and it outputs a meticulously cleaned, filtered, and deduplicated dataset ready for model training.

354 stars.

Use this if you need to create extremely large, high-quality, multimodal datasets for training advanced AI foundation models.

Not ideal if you are working with small datasets, structured data, or do not need to process hundreds of billions of data points.

AI model training Large Language Model (LLM) multimodal data processing dataset curation foundation models

No Package No Dependents

Maintenance 10 / 25

Adoption 10 / 25

Maturity 11 / 25

Community 17 / 25

How are scores calculated?

Stars

354

Forks

Language

Python

License

MIT

Higher-rated alternatives

datajuicer/data-juicer

Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷

dermatologist/pyomop

Python package for managing OHDSI clinical data models. Includes support for LLM based plain...

Explore MLOps Tools

All categories Trending MLOps directory Insights