balrog-ai/BALROG

Benchmarking Agentic LLM and VLM Reasoning On Games

/ 100

Established

BALROG helps AI researchers and machine learning engineers evaluate how well advanced AI models (like large language models or vision-language models) can perform complex, multi-step tasks in interactive environments, such as games. You provide your agentic AI model and it outputs a comprehensive performance benchmark, showing its reasoning abilities over time. This tool is designed for those who develop or research advanced AI agents and need to rigorously test their capabilities.

237 stars.

Use this if you are an AI researcher or machine learning engineer developing agentic LLMs or VLMs and need a standardized way to benchmark their performance on interactive, long-horizon tasks.

Not ideal if you are looking for a tool to train AI models or if your primary interest is in evaluating static, single-turn language tasks.

AI-agent-development LLM-benchmarking VLM-evaluation reinforcement-learning-environments AI-research

No Package No Dependents

Maintenance 10 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 20 / 25

How are scores calculated?

Stars

237

Forks

Language

Python

License

MIT

Related tools

francois-rd/accord

Anti-faCtual COmmonsense Reasoning Disentanglement

Explore RAG Tools

All categories Trending RAG directory Insights