GAIR-NLP/ProX
[ICML 2025] Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale
This project helps machine learning engineers and researchers improve the quality of large language model (LLM) training data. It takes raw, large-scale text datasets and automatically cleans and refines them. The result is a higher-quality pre-training corpus that leads to better-performing LLMs across various tasks, including general domain and specialized areas like mathematics.
266 stars. No commits in the last 6 months.
Use this if you are a machine learning engineer or researcher looking to create more effective large language models by starting with expertly refined, high-quality training data.
Not ideal if you are looking for a tool to fine-tune an existing LLM or for data preparation outside of large-scale pre-training datasets.
Stars
266
Forks
17
Language
Python
License
Apache-2.0
Category
Last pushed
Jul 08, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/GAIR-NLP/ProX"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
ModelCloud/GPTQModel
LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD...
intel/auto-round
🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality...
pytorch/ao
PyTorch native quantization and sparsity for training and inference
bodaay/HuggingFaceModelDownloader
Simple go utility to download HuggingFace Models and Datasets
NVIDIA/kvpress
LLM KV cache compression made easy