NVIDIA-NeMo/Curator

Scalable data pre processing and curation toolkit for LLMs

71
/ 100
Verified

This tool helps AI engineers and researchers prepare massive datasets for training large language models and other generative AI. It takes raw text, images, video, or audio data from various sources and outputs cleaned, filtered, and deduplicated datasets. The primary users are MLOps engineers and AI researchers focused on building and improving large-scale AI models.

1,443 stars. Actively maintained with 71 commits in the last 30 days.

Use this if you need to efficiently clean, filter, and deduplicate extremely large datasets across multiple modalities (text, images, video, audio) to improve the quality and performance of your AI models.

Not ideal if you are working with small datasets or if your primary need is not large-scale, GPU-accelerated data processing for AI model training.

AI model training large language models generative AI data preprocessing machine learning operations
No Package No Dependents
Maintenance 22 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 23 / 25

How are scores calculated?

Stars

1,443

Forks

230

Language

Python

License

Apache-2.0

Last pushed

Mar 12, 2026

Commits (30d)

71

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/NVIDIA-NeMo/Curator"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.