bigcode-project/bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

/ 100

Established

This tool helps AI researchers and developers assess how well large language models (LLMs) can generate code to automate practical, challenging software development tasks. It takes an LLM's code output for a given task and evaluates its correctness and efficiency, providing precise rankings on a leaderboard. Researchers and developers working on improving code generation capabilities of LLMs would find this useful.

484 stars. Available on PyPI.

Use this if you are developing or evaluating large language models and need a standardized, rigorous way to benchmark their ability to write functional code from complex instructions.

Not ideal if you are looking for a tool to help you write code directly or integrate code generation into an application, as this is purely for benchmarking LLMs.

AI-research LLM-development code-generation model-benchmarking software-engineering-AI

Maintenance 6 / 25

Adoption 10 / 25

Maturity 25 / 25

Community 20 / 25

How are scores calculated?

Stars

484

Forks

Language

Python

License

Apache-2.0

Featured in

You're Shipping AI You Can't Measure

Compare

bigcodebench and AgentBench

Related tools

sierra-research/tau2-bench

τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment

xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

scicode-bench/SciCode

A benchmark that challenges language models to code solutions for scientific problems

swefficiency/swefficiency

Benchmark harness and code for "SWE-fficiency: Can Language Models Optimize Real World...

Explore LLM Tools

All categories Trending LLM Tool directory Insights