evo-eval/evoeval

EvoEval: Evolving Coding Benchmarks via LLM

/ 100

Established

This project helps evaluate how well large language models (LLMs) can write code. It takes in a set of coding problems (benchmarks) and the code solutions generated by an LLM, then outputs a 'pass@1' score indicating the percentage of problems the LLM solved correctly. It's designed for AI researchers and engineers who are developing or comparing different code-generating LLMs.

No commits in the last 6 months. Available on PyPI.

Use this if you need to rigorously test the code generation capabilities of various LLMs across different challenge types, from subtle changes to complex, multi-step problems.

Not ideal if you are looking to evaluate an LLM for natural language tasks, creative writing, or any application outside of programming problem-solving.

LLM evaluation code generation benchmarking AI model performance programming challenge machine learning research

Stale 6m

Maintenance 0 / 25

Adoption 9 / 25

Maturity 25 / 25

Community 16 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

Apache-2.0

Featured in

You're Shipping AI You Can't Measure

Related tools

EvolvingLMMs-Lab/lmms-eval

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

vibrantlabsai/ragas

Supercharge Your LLM Application Evaluations 🚀

open-compass/VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

EuroEval/EuroEval

The robust European language model benchmark.

Giskard-AI/giskard-oss

🐢 Open-Source Evaluation & Testing library for LLM Agents

Explore LLM Tools

All categories Trending LLM Tool directory Insights