THUDM/AgentBench
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
This project helps developers and researchers evaluate how well large language models (LLMs) can act as autonomous 'agents' in various real-world scenarios. It takes an LLM as input and runs it through a standardized set of tasks, like interacting with an operating system, using a database, or shopping online. The output is a performance score, showing how effectively the LLM completes these multi-step, interactive tasks.
3,234 stars.
Use this if you are a developer or researcher who needs to rigorously test and compare the capabilities of different LLMs as decision-making agents in diverse environments.
Not ideal if you are an end-user simply looking to use an LLM for specific applications without needing to evaluate its underlying agentic performance.
Stars
3,234
Forks
241
Language
Python
License
Apache-2.0
Category
Last pushed
Feb 08, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/THUDM/AgentBench"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Compare
Related tools
sierra-research/tau2-bench
τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment
xlang-ai/OSWorld
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
bigcode-project/bigcodebench
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
scicode-bench/SciCode
A benchmark that challenges language models to code solutions for scientific problems
swefficiency/swefficiency
Benchmark harness and code for "SWE-fficiency: Can Language Models Optimize Real World...