JIA-Lab-research/Step-DPO
Implementation for "Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs"
This project helps improve how large language models (LLMs) solve complex, multi-step math problems. It takes an existing LLM and specialized math preference data, then fine-tunes the model to better break down and solve problems step-by-step. Researchers or AI developers working with LLMs in educational technology or scientific computing can use this to create more accurate and reliable reasoning models.
392 stars. No commits in the last 6 months.
Use this if you need an LLM to excel at multi-step reasoning tasks, especially in mathematics, by leveraging preference-based fine-tuning.
Not ideal if your primary goal is general conversational ability rather than detailed, step-wise problem-solving.
Stars
392
Forks
16
Language
Python
License
—
Category
Last pushed
Jan 19, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/JIA-Lab-research/Step-DPO"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
stair-lab/mlhp
Machine Learning from Human Preferences
princeton-nlp/SimPO
[NeurIPS 2024] SimPO: Simple Preference Optimization with a Reference-Free Reward
uclaml/SPPO
The official implementation of Self-Play Preference Optimization (SPPO)
general-preference/general-preference-model
[ICML 2025] Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment...
sail-sg/dice
Official implementation of Bootstrapping Language Models via DPO Implicit Rewards