bigcode-project/bigcodearena

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

32
/ 100
Emerging

This platform helps you evaluate how well different large language models (LLMs) generate code. You input code generation tasks, and the system shows you the LLM-generated code alongside its execution results in various environments. As an AI researcher or developer, you can then judge the quality of the generated code, helping improve LLMs or assess their performance.

No commits in the last 6 months.

Use this if you need to systematically compare the coding abilities of different LLMs, either through human evaluation with execution insights or fully automatic benchmarking.

Not ideal if you are a general user looking for an LLM coding assistant or want to use LLMs for everyday coding tasks without evaluating their underlying performance.

LLM evaluation code generation AI model benchmarking developer tools software engineering
Stale 6m No Package No Dependents
Maintenance 2 / 25
Adoption 8 / 25
Maturity 15 / 25
Community 7 / 25

How are scores calculated?

Stars

58

Forks

3

Language

Python

License

Apache-2.0

Last pushed

Oct 13, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/ai-coding/bigcode-project/bigcodearena"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.