evo-eval/evoeval

EvoEval: Evolving Coding Benchmarks via LLM

50
/ 100
Established

This project helps evaluate how well large language models (LLMs) can write code. It takes in a set of coding problems (benchmarks) and the code solutions generated by an LLM, then outputs a 'pass@1' score indicating the percentage of problems the LLM solved correctly. It's designed for AI researchers and engineers who are developing or comparing different code-generating LLMs.

No commits in the last 6 months. Available on PyPI.

Use this if you need to rigorously test the code generation capabilities of various LLMs across different challenge types, from subtle changes to complex, multi-step problems.

Not ideal if you are looking to evaluate an LLM for natural language tasks, creative writing, or any application outside of programming problem-solving.

LLM evaluation code generation benchmarking AI model performance programming challenge machine learning research
Stale 6m
Maintenance 0 / 25
Adoption 9 / 25
Maturity 25 / 25
Community 16 / 25

How are scores calculated?

Stars

81

Forks

13

Language

Python

License

Apache-2.0

Last pushed

Apr 06, 2024

Commits (30d)

0

Dependencies

8

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/evo-eval/evoeval"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.