JIA-Lab-research/Step-DPO

Implementation for "Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs"

/ 100

Experimental

This project helps improve how large language models (LLMs) solve complex, multi-step math problems. It takes an existing LLM and specialized math preference data, then fine-tunes the model to better break down and solve problems step-by-step. Researchers or AI developers working with LLMs in educational technology or scientific computing can use this to create more accurate and reliable reasoning models.

392 stars. No commits in the last 6 months.

Use this if you need an LLM to excel at multi-step reasoning tasks, especially in mathematics, by leveraging preference-based fine-tuning.

Not ideal if your primary goal is general conversational ability rather than detailed, step-wise problem-solving.

mathematical-reasoning large-language-models AI-model-training computational-mathematics educational-AI

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 8 / 25

Community 10 / 25

How are scores calculated?

Stars

392

Forks

Language

Python

License

—

Higher-rated alternatives

stair-lab/mlhp

Machine Learning from Human Preferences

princeton-nlp/SimPO

[NeurIPS 2024] SimPO: Simple Preference Optimization with a Reference-Free Reward

uclaml/SPPO

The official implementation of Self-Play Preference Optimization (SPPO)

general-preference/general-preference-model

[ICML 2025] Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment...

sail-sg/dice

Official implementation of Bootstrapping Language Models via DPO Implicit Rewards

Explore Transformer Models

All categories Trending Transformer directory Insights