bigcode-project/bigcodearena

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

/ 100

Emerging

This platform helps you evaluate how well different large language models (LLMs) generate code. You input code generation tasks, and the system shows you the LLM-generated code alongside its execution results in various environments. As an AI researcher or developer, you can then judge the quality of the generated code, helping improve LLMs or assess their performance.

No commits in the last 6 months.

Use this if you need to systematically compare the coding abilities of different LLMs, either through human evaluation with execution insights or fully automatic benchmarking.

Not ideal if you are a general user looking for an LLM coding assistant or want to use LLMs for everyday coding tasks without evaluating their underlying performance.

LLM evaluation code generation AI model benchmarking developer tools software engineering

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 8 / 25

Maturity 15 / 25

Community 7 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

Apache-2.0

Higher-rated alternatives

THU-WingTecher/LSPRAG

Real-time multi-language unit test generation tool via LSP

metareflection/dafny-replay

Verified kernels, written in Dafny and compiled to JavaScript, for correct-by-construction state...

santinic/unvibe

Generate correct code from unit-tests

adilanwar2399/ESBMC-ibmc

The ESBMC ibmc (Invariant Based Model Checking) Tool.

mpuodziukas-labs/cobol-demo

COBOL modernization: LLMs introduce bugs, humans validate. Production-grade analysis tooling.

Explore AI Coding Tools

All categories Trending AI Coding directory Insights