Curator and data-prep-kit

Curator
71
Verified
data-prep-kit
64
Established
Maintenance 22/25
Adoption 10/25
Maturity 16/25
Community 23/25
Maintenance 13/25
Adoption 10/25
Maturity 16/25
Community 25/25
Stars: 1,443
Forks: 230
Downloads:
Commits (30d): 71
Language: Python
License: Apache-2.0
Stars: 906
Forks: 247
Downloads:
Commits (30d): 2
Language: HTML
License: Apache-2.0
No Package No Dependents
No Package No Dependents

About Curator

NVIDIA-NeMo/Curator

Scalable data pre processing and curation toolkit for LLMs

This tool helps AI engineers and researchers prepare massive datasets for training large language models and other generative AI. It takes raw text, images, video, or audio data from various sources and outputs cleaned, filtered, and deduplicated datasets. The primary users are MLOps engineers and AI researchers focused on building and improving large-scale AI models.

AI model training large language models generative AI data preprocessing machine learning operations

About data-prep-kit

data-prep-kit/data-prep-kit

Open source project for data preparation for GenAI applications

This kit helps AI application developers prepare unstructured data for use in large language models (LLMs). It takes raw text, code, or image data from various sources like PDFs, HTML, or zip files and cleanses, transforms, and enriches it. The output is high-quality, structured data ready for pre-training, fine-tuning, or building Retrieval Augmented Generation (RAG) applications.

AI development LLM data preparation natural language processing RAG applications unstructured data

Scores updated daily from GitHub, PyPI, and npm data. How scores work