duoan/mega-data-factory

🏭 Mega Scale Multimodal DataPipeline for SOTA Foundation Models

48
/ 100
Emerging

This is a powerful toolkit for researchers and engineers who are building the next generation of AI models, like large language models or image generators. It helps you collect massive amounts of raw internet data (text, images, video) and transform it into high-quality training datasets. You input vast, messy web data, and it outputs a meticulously cleaned, filtered, and deduplicated dataset ready for model training.

354 stars.

Use this if you need to create extremely large, high-quality, multimodal datasets for training advanced AI foundation models.

Not ideal if you are working with small datasets, structured data, or do not need to process hundreds of billions of data points.

AI model training Large Language Model (LLM) multimodal data processing dataset curation foundation models
No Package No Dependents
Maintenance 10 / 25
Adoption 10 / 25
Maturity 11 / 25
Community 17 / 25

How are scores calculated?

Stars

354

Forks

44

Language

Python

License

MIT

Last pushed

Mar 11, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/mlops/duoan/mega-data-factory"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.