NVlabs/GDPO

Official implementation of GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

46
/ 100
Emerging

When training AI models to perform complex tasks like using tools, solving math problems, or generating code, models often need to learn from multiple different types of feedback or 'rewards' simultaneously. GDPO helps stabilize this multi-reward training process. It takes an existing model and various reward signals, then optimizes the model's behavior to better satisfy all the reward criteria, resulting in a more reliable and higher-performing AI.

413 stars.

Use this if you are training large language models or other AI agents that receive multiple, distinct reward signals and you need to ensure stable and effective learning across all of them.

Not ideal if your AI model is trained with only a single reward signal, as the benefits of decoupled normalization would not apply.

large-language-model-training reinforcement-learning-from-human-feedback multi-objective-optimization AI-agent-development
No Package No Dependents
Maintenance 10 / 25
Adoption 10 / 25
Maturity 13 / 25
Community 13 / 25

How are scores calculated?

Stars

413

Forks

24

Language

Python

License

Apache-2.0

Last pushed

Feb 17, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/NVlabs/GDPO"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.