balrog-ai/BALROG

Benchmarking Agentic LLM and VLM Reasoning On Games

56
/ 100
Established

BALROG helps AI researchers and machine learning engineers evaluate how well advanced AI models (like large language models or vision-language models) can perform complex, multi-step tasks in interactive environments, such as games. You provide your agentic AI model and it outputs a comprehensive performance benchmark, showing its reasoning abilities over time. This tool is designed for those who develop or research advanced AI agents and need to rigorously test their capabilities.

237 stars.

Use this if you are an AI researcher or machine learning engineer developing agentic LLMs or VLMs and need a standardized way to benchmark their performance on interactive, long-horizon tasks.

Not ideal if you are looking for a tool to train AI models or if your primary interest is in evaluating static, single-turn language tasks.

AI-agent-development LLM-benchmarking VLM-evaluation reinforcement-learning-environments AI-research
No Package No Dependents
Maintenance 10 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 20 / 25

How are scores calculated?

Stars

237

Forks

42

Language

Python

License

MIT

Last pushed

Feb 10, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/rag/balrog-ai/BALROG"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.