bigcode-project/bigcodebench
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
This tool helps AI researchers and developers assess how well large language models (LLMs) can generate code to automate practical, challenging software development tasks. It takes an LLM's code output for a given task and evaluates its correctness and efficiency, providing precise rankings on a leaderboard. Researchers and developers working on improving code generation capabilities of LLMs would find this useful.
484 stars. Available on PyPI.
Use this if you are developing or evaluating large language models and need a standardized, rigorous way to benchmark their ability to write functional code from complex instructions.
Not ideal if you are looking for a tool to help you write code directly or integrate code generation into an application, as this is purely for benchmarking LLMs.
Stars
484
Forks
64
Language
Python
License
Apache-2.0
Category
Last pushed
Jan 03, 2026
Commits (30d)
0
Dependencies
22
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/bigcode-project/bigcodebench"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Compare
Related tools
sierra-research/tau2-bench
τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment
xlang-ai/OSWorld
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
THUDM/AgentBench
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
scicode-bench/SciCode
A benchmark that challenges language models to code solutions for scientific problems
swefficiency/swefficiency
Benchmark harness and code for "SWE-fficiency: Can Language Models Optimize Real World...