balrog-ai/BALROG
Benchmarking Agentic LLM and VLM Reasoning On Games
BALROG helps AI researchers and machine learning engineers evaluate how well advanced AI models (like large language models or vision-language models) can perform complex, multi-step tasks in interactive environments, such as games. You provide your agentic AI model and it outputs a comprehensive performance benchmark, showing its reasoning abilities over time. This tool is designed for those who develop or research advanced AI agents and need to rigorously test their capabilities.
237 stars.
Use this if you are an AI researcher or machine learning engineer developing agentic LLMs or VLMs and need a standardized way to benchmark their performance on interactive, long-horizon tasks.
Not ideal if you are looking for a tool to train AI models or if your primary interest is in evaluating static, single-turn language tasks.
Stars
237
Forks
42
Language
Python
License
MIT
Category
Last pushed
Feb 10, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/rag/balrog-ai/BALROG"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.