NVlabs/GDPO

Official implementation of GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

/ 100

Emerging

When training AI models to perform complex tasks like using tools, solving math problems, or generating code, models often need to learn from multiple different types of feedback or 'rewards' simultaneously. GDPO helps stabilize this multi-reward training process. It takes an existing model and various reward signals, then optimizes the model's behavior to better satisfy all the reward criteria, resulting in a more reliable and higher-performing AI.

413 stars.

Use this if you are training large language models or other AI agents that receive multiple, distinct reward signals and you need to ensure stable and effective learning across all of them.

Not ideal if your AI model is trained with only a single reward signal, as the benefits of decoupled normalization would not apply.

large-language-model-training reinforcement-learning-from-human-feedback multi-objective-optimization AI-agent-development

No Package No Dependents

Maintenance 10 / 25

Adoption 10 / 25

Maturity 13 / 25

Community 13 / 25

How are scores calculated?

Stars

413

Forks

Language

Python

License

Apache-2.0

Higher-rated alternatives

hud-evals/hud-python

OSS RL environment + evals toolkit

hiyouga/EasyR1

EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework based on veRL

OpenRL-Lab/openrl

Unified Reinforcement Learning Framework

sail-sg/oat

🌾 OAT: A research-friendly framework for LLM online alignment, including reinforcement learning,...

opendilab/awesome-RLHF

A curated list of reinforcement learning with human feedback resources (continually updated)

Explore LLM Tools

All categories Trending LLM Tool directory Insights