zafstojano/policy-gradients

A minimal hackable implementation of policy gradient methods (GRPO, PPO, REINFORCE)

28
/ 100
Experimental

This project provides an easy-to-understand and debuggable implementation of various policy gradient algorithms used in large language model training. It allows machine learning researchers and practitioners to experiment with different reinforcement learning algorithms like PPO, GRPO, and REINFORCE. You can input model configurations and training data, then receive a trained policy model suitable for tasks like teaching an LLM new behaviors or optimizing its responses.

Use this if you need to train a large language model with reinforcement learning but find existing production-grade libraries too complex and difficult to debug, and prefer a simpler, single-GPU setup for educational purposes or initial experimentation.

Not ideal if you require a distributed, production-scale reinforcement learning setup with maximum training speed and efficiency, as this project prioritizes simplicity and debuggability over enterprise-level performance.

large-language-models reinforcement-learning llm-fine-tuning policy-optimization machine-learning-research
No Package No Dependents
Maintenance 10 / 25
Adoption 5 / 25
Maturity 13 / 25
Community 0 / 25

How are scores calculated?

Stars

13

Forks

Language

Python

License

Apache-2.0

Last pushed

Feb 20, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/zafstojano/policy-gradients"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.