duoan/mega-data-factory
🏭 Mega Scale Multimodal DataPipeline for SOTA Foundation Models
This is a powerful toolkit for researchers and engineers who are building the next generation of AI models, like large language models or image generators. It helps you collect massive amounts of raw internet data (text, images, video) and transform it into high-quality training datasets. You input vast, messy web data, and it outputs a meticulously cleaned, filtered, and deduplicated dataset ready for model training.
354 stars.
Use this if you need to create extremely large, high-quality, multimodal datasets for training advanced AI foundation models.
Not ideal if you are working with small datasets, structured data, or do not need to process hundreds of billions of data points.
Stars
354
Forks
44
Language
Python
License
MIT
Category
Last pushed
Mar 11, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/mlops/duoan/mega-data-factory"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.