GAIR-NLP/ProX

[ICML 2025] Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale

/ 100

Emerging

This project helps machine learning engineers and researchers improve the quality of large language model (LLM) training data. It takes raw, large-scale text datasets and automatically cleans and refines them. The result is a higher-quality pre-training corpus that leads to better-performing LLMs across various tasks, including general domain and specialized areas like mathematics.

266 stars. No commits in the last 6 months.

Use this if you are a machine learning engineer or researcher looking to create more effective large language models by starting with expertly refined, high-quality training data.

Not ideal if you are looking for a tool to fine-tune an existing LLM or for data preparation outside of large-scale pre-training datasets.

LLM pre-training data quality natural language processing machine learning research AI model development

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 12 / 25

How are scores calculated?

Stars

266

Forks

Language

Python

License

Apache-2.0

Higher-rated alternatives

ModelCloud/GPTQModel

LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD...

intel/auto-round

🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality...

pytorch/ao

PyTorch native quantization and sparsity for training and inference

bodaay/HuggingFaceModelDownloader

Simple go utility to download HuggingFace Models and Datasets

NVIDIA/kvpress

LLM KV cache compression made easy

Explore Transformer Models

All categories Trending Transformer directory Insights