NVlabs/GDPO
Official implementation of GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
When training AI models to perform complex tasks like using tools, solving math problems, or generating code, models often need to learn from multiple different types of feedback or 'rewards' simultaneously. GDPO helps stabilize this multi-reward training process. It takes an existing model and various reward signals, then optimizes the model's behavior to better satisfy all the reward criteria, resulting in a more reliable and higher-performing AI.
413 stars.
Use this if you are training large language models or other AI agents that receive multiple, distinct reward signals and you need to ensure stable and effective learning across all of them.
Not ideal if your AI model is trained with only a single reward signal, as the benefits of decoupled normalization would not apply.
Stars
413
Forks
24
Language
Python
License
Apache-2.0
Category
Last pushed
Feb 17, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/NVlabs/GDPO"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
hud-evals/hud-python
OSS RL environment + evals toolkit
hiyouga/EasyR1
EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework based on veRL
OpenRL-Lab/openrl
Unified Reinforcement Learning Framework
sail-sg/oat
🌾 OAT: A research-friendly framework for LLM online alignment, including reinforcement learning,...
opendilab/awesome-RLHF
A curated list of reinforcement learning with human feedback resources (continually updated)