NVIDIA-NeMo/Curator
Scalable data pre processing and curation toolkit for LLMs
This tool helps AI engineers and researchers prepare massive datasets for training large language models and other generative AI. It takes raw text, images, video, or audio data from various sources and outputs cleaned, filtered, and deduplicated datasets. The primary users are MLOps engineers and AI researchers focused on building and improving large-scale AI models.
1,443 stars. Actively maintained with 71 commits in the last 30 days.
Use this if you need to efficiently clean, filter, and deduplicate extremely large datasets across multiple modalities (text, images, video, audio) to improve the quality and performance of your AI models.
Not ideal if you are working with small datasets or if your primary need is not large-scale, GPU-accelerated data processing for AI model training.
Stars
1,443
Forks
230
Language
Python
License
Apache-2.0
Category
Last pushed
Mar 12, 2026
Commits (30d)
71
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/NVIDIA-NeMo/Curator"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Compare
Related tools
MigoXLab/dingo
Dingo: A Comprehensive AI Data, Model and Application Quality Evaluation Tool
data-prep-kit/data-prep-kit
Open source project for data preparation for GenAI applications
TheDataStation/pneuma
LLM-Powered Data Discovery System for Tabular Data
cleanlab/cleanlab-studio
Client interface to Cleanlab Studio
jpmorganchase/CodeQuest
CodeQUEST is a generalizable framework which leverages LLMs to iteratively evaluate and enhance...