bigcodebench and AgentBench
BigCodeBench focuses specifically on evaluating code generation capabilities through programming tasks, while AgentBench evaluates LLMs across diverse agent-based reasoning tasks—making them complementary benchmarks that assess different dimensions of LLM capability (coding vs. agentic behavior) rather than direct competitors.
About bigcodebench
bigcode-project/bigcodebench
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
This tool helps AI researchers and developers assess how well large language models (LLMs) can generate code to automate practical, challenging software development tasks. It takes an LLM's code output for a given task and evaluates its correctness and efficiency, providing precise rankings on a leaderboard. Researchers and developers working on improving code generation capabilities of LLMs would find this useful.
About AgentBench
THUDM/AgentBench
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
This project helps developers and researchers evaluate how well large language models (LLMs) can act as autonomous 'agents' in various real-world scenarios. It takes an LLM as input and runs it through a standardized set of tasks, like interacting with an operating system, using a database, or shopping online. The output is a performance score, showing how effectively the LLM completes these multi-step, interactive tasks.
Related comparisons
Scores updated daily from GitHub, PyPI, and npm data. How scores work