zafstojano/policy-gradients

A minimal hackable implementation of policy gradient methods (GRPO, PPO, REINFORCE)

/ 100

Experimental

This project provides an easy-to-understand and debuggable implementation of various policy gradient algorithms used in large language model training. It allows machine learning researchers and practitioners to experiment with different reinforcement learning algorithms like PPO, GRPO, and REINFORCE. You can input model configurations and training data, then receive a trained policy model suitable for tasks like teaching an LLM new behaviors or optimizing its responses.

Use this if you need to train a large language model with reinforcement learning but find existing production-grade libraries too complex and difficult to debug, and prefer a simpler, single-GPU setup for educational purposes or initial experimentation.

Not ideal if you require a distributed, production-scale reinforcement learning setup with maximum training speed and efficiency, as this project prioritizes simplicity and debuggability over enterprise-level performance.

large-language-models reinforcement-learning llm-fine-tuning policy-optimization machine-learning-research

No Package No Dependents

Maintenance 10 / 25

Adoption 5 / 25

Maturity 13 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

Python

License

Apache-2.0

Higher-rated alternatives

hud-evals/hud-python

OSS RL environment + evals toolkit

hiyouga/EasyR1

EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework based on veRL

OpenRL-Lab/openrl

Unified Reinforcement Learning Framework

sail-sg/oat

🌾 OAT: A research-friendly framework for LLM online alignment, including reinforcement learning,...

opendilab/awesome-RLHF

A curated list of reinforcement learning with human feedback resources (continually updated)

Explore LLM Tools

All categories Trending LLM Tool directory Insights