bigcodebench and AgentBench

BigCodeBench focuses specifically on evaluating code generation capabilities through programming tasks, while AgentBench evaluates LLMs across diverse agent-based reasoning tasks—making them complementary benchmarks that assess different dimensions of LLM capability (coding vs. agentic behavior) rather than direct competitors.

bigcodebench

Established

AgentBench

Established

Maintenance 6/25

Adoption 10/25

Maturity 25/25

Community 20/25

Maintenance 10/25

Adoption 10/25

Maturity 16/25

Community 19/25

Stars: 484

Forks: 64

Downloads: —

Commits (30d): 0

Language: Python

License: Apache-2.0

Stars: 3,234

Forks: 241

Downloads: —

Commits (30d): 0

Language: Python

License: Apache-2.0

No risk flags

No Package No Dependents

About bigcodebench

bigcode-project/bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

This tool helps AI researchers and developers assess how well large language models (LLMs) can generate code to automate practical, challenging software development tasks. It takes an LLM's code output for a given task and evaluates its correctness and efficiency, providing precise rankings on a leaderboard. Researchers and developers working on improving code generation capabilities of LLMs would find this useful.

AI-research LLM-development code-generation model-benchmarking software-engineering-AI

About AgentBench

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

This project helps developers and researchers evaluate how well large language models (LLMs) can act as autonomous 'agents' in various real-world scenarios. It takes an LLM as input and runs it through a standardized set of tasks, like interacting with an operating system, using a database, or shopping online. The output is a performance score, showing how effectively the LLM completes these multi-step, interactive tasks.

LLM evaluation agentic AI AI research model benchmarking autonomous systems

Related comparisons

bigcodebench and LawBench bigcodebench and LLM-Agent-Benchmark-List bigcodebench and MemoryAgentBench bigcodebench and heurigym

Scores updated daily from GitHub, PyPI, and npm data. How scores work