bigcode-project/bigcodearena
BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution
This platform helps you evaluate how well different large language models (LLMs) generate code. You input code generation tasks, and the system shows you the LLM-generated code alongside its execution results in various environments. As an AI researcher or developer, you can then judge the quality of the generated code, helping improve LLMs or assess their performance.
No commits in the last 6 months.
Use this if you need to systematically compare the coding abilities of different LLMs, either through human evaluation with execution insights or fully automatic benchmarking.
Not ideal if you are a general user looking for an LLM coding assistant or want to use LLMs for everyday coding tasks without evaluating their underlying performance.
Stars
58
Forks
3
Language
Python
License
Apache-2.0
Category
Last pushed
Oct 13, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ai-coding/bigcode-project/bigcodearena"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
THU-WingTecher/LSPRAG
Real-time multi-language unit test generation tool via LSP
metareflection/dafny-replay
Verified kernels, written in Dafny and compiled to JavaScript, for correct-by-construction state...
santinic/unvibe
Generate correct code from unit-tests
adilanwar2399/ESBMC-ibmc
The ESBMC ibmc (Invariant Based Model Checking) Tool.
mpuodziukas-labs/cobol-demo
COBOL modernization: LLMs introduce bugs, humans validate. Production-grade analysis tooling.