zafstojano/policy-gradients
A minimal hackable implementation of policy gradient methods (GRPO, PPO, REINFORCE)
This project provides an easy-to-understand and debuggable implementation of various policy gradient algorithms used in large language model training. It allows machine learning researchers and practitioners to experiment with different reinforcement learning algorithms like PPO, GRPO, and REINFORCE. You can input model configurations and training data, then receive a trained policy model suitable for tasks like teaching an LLM new behaviors or optimizing its responses.
Use this if you need to train a large language model with reinforcement learning but find existing production-grade libraries too complex and difficult to debug, and prefer a simpler, single-GPU setup for educational purposes or initial experimentation.
Not ideal if you require a distributed, production-scale reinforcement learning setup with maximum training speed and efficiency, as this project prioritizes simplicity and debuggability over enterprise-level performance.
Stars
13
Forks
—
Language
Python
License
Apache-2.0
Category
Last pushed
Feb 20, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/zafstojano/policy-gradients"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
hud-evals/hud-python
OSS RL environment + evals toolkit
hiyouga/EasyR1
EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework based on veRL
OpenRL-Lab/openrl
Unified Reinforcement Learning Framework
sail-sg/oat
🌾 OAT: A research-friendly framework for LLM online alignment, including reinforcement learning,...
opendilab/awesome-RLHF
A curated list of reinforcement learning with human feedback resources (continually updated)