bigcode-project/bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

61
/ 100
Established

This tool helps AI researchers and developers assess how well large language models (LLMs) can generate code to automate practical, challenging software development tasks. It takes an LLM's code output for a given task and evaluates its correctness and efficiency, providing precise rankings on a leaderboard. Researchers and developers working on improving code generation capabilities of LLMs would find this useful.

484 stars. Available on PyPI.

Use this if you are developing or evaluating large language models and need a standardized, rigorous way to benchmark their ability to write functional code from complex instructions.

Not ideal if you are looking for a tool to help you write code directly or integrate code generation into an application, as this is purely for benchmarking LLMs.

AI-research LLM-development code-generation model-benchmarking software-engineering-AI
Maintenance 6 / 25
Adoption 10 / 25
Maturity 25 / 25
Community 20 / 25

How are scores calculated?

Stars

484

Forks

64

Language

Python

License

Apache-2.0

Last pushed

Jan 03, 2026

Commits (30d)

0

Dependencies

22

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/bigcode-project/bigcodebench"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.