GAIR-NLP/ProX

[ICML 2025] Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale

40
/ 100
Emerging

This project helps machine learning engineers and researchers improve the quality of large language model (LLM) training data. It takes raw, large-scale text datasets and automatically cleans and refines them. The result is a higher-quality pre-training corpus that leads to better-performing LLMs across various tasks, including general domain and specialized areas like mathematics.

266 stars. No commits in the last 6 months.

Use this if you are a machine learning engineer or researcher looking to create more effective large language models by starting with expertly refined, high-quality training data.

Not ideal if you are looking for a tool to fine-tune an existing LLM or for data preparation outside of large-scale pre-training datasets.

LLM pre-training data quality natural language processing machine learning research AI model development
Stale 6m No Package No Dependents
Maintenance 2 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 12 / 25

How are scores calculated?

Stars

266

Forks

17

Language

Python

License

Apache-2.0

Last pushed

Jul 08, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/GAIR-NLP/ProX"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.