NVIDIA-NeMo/Curator

Scalable data pre processing and curation toolkit for LLMs

/ 100

Verified

This tool helps AI engineers and researchers prepare massive datasets for training large language models and other generative AI. It takes raw text, images, video, or audio data from various sources and outputs cleaned, filtered, and deduplicated datasets. The primary users are MLOps engineers and AI researchers focused on building and improving large-scale AI models.

1,443 stars. Actively maintained with 71 commits in the last 30 days.

Use this if you need to efficiently clean, filter, and deduplicate extremely large datasets across multiple modalities (text, images, video, audio) to improve the quality and performance of your AI models.

Not ideal if you are working with small datasets or if your primary need is not large-scale, GPU-accelerated data processing for AI model training.

AI model training large language models generative AI data preprocessing machine learning operations

No Package No Dependents

Maintenance 22 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 23 / 25

How are scores calculated?

Stars

1,443

Forks

230

Language

Python

License

Apache-2.0

Compare

Curator and data-prep-kit

Related tools

MigoXLab/dingo

Dingo: A Comprehensive AI Data, Model and Application Quality Evaluation Tool

data-prep-kit/data-prep-kit

Open source project for data preparation for GenAI applications

TheDataStation/pneuma

LLM-Powered Data Discovery System for Tabular Data

cleanlab/cleanlab-studio

Client interface to Cleanlab Studio

jpmorganchase/CodeQuest

CodeQUEST is a generalizable framework which leverages LLMs to iteratively evaluate and enhance...

Explore LLM Tools

All categories Trending LLM Tool directory Insights