THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

/ 100

Established

This project helps developers and researchers evaluate how well large language models (LLMs) can act as autonomous 'agents' in various real-world scenarios. It takes an LLM as input and runs it through a standardized set of tasks, like interacting with an operating system, using a database, or shopping online. The output is a performance score, showing how effectively the LLM completes these multi-step, interactive tasks.

3,234 stars.

Use this if you are a developer or researcher who needs to rigorously test and compare the capabilities of different LLMs as decision-making agents in diverse environments.

Not ideal if you are an end-user simply looking to use an LLM for specific applications without needing to evaluate its underlying agentic performance.

LLM evaluation agentic AI AI research model benchmarking autonomous systems

No Package No Dependents

Maintenance 10 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 19 / 25

How are scores calculated?

Stars

3,234

Forks

241

Language

Python

License

Apache-2.0

Featured in

You're Shipping AI You Can't Measure

Compare

AgentBench and bigcodebench AgentBench and LawBench AgentBench and LLM-Agent-Benchmark-List AgentBench and MemoryAgentBench AgentBench and heurigym

Related tools

sierra-research/tau2-bench

τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment

xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

bigcode-project/bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

scicode-bench/SciCode

A benchmark that challenges language models to code solutions for scientific problems

swefficiency/swefficiency

Benchmark harness and code for "SWE-fficiency: Can Language Models Optimize Real World...

Explore LLM Tools

All categories Trending LLM Tool directory Insights