evo-eval/evoeval
EvoEval: Evolving Coding Benchmarks via LLM
This project helps evaluate how well large language models (LLMs) can write code. It takes in a set of coding problems (benchmarks) and the code solutions generated by an LLM, then outputs a 'pass@1' score indicating the percentage of problems the LLM solved correctly. It's designed for AI researchers and engineers who are developing or comparing different code-generating LLMs.
No commits in the last 6 months. Available on PyPI.
Use this if you need to rigorously test the code generation capabilities of various LLMs across different challenge types, from subtle changes to complex, multi-step problems.
Not ideal if you are looking to evaluate an LLM for natural language tasks, creative writing, or any application outside of programming problem-solving.
Stars
81
Forks
13
Language
Python
License
Apache-2.0
Category
Last pushed
Apr 06, 2024
Commits (30d)
0
Dependencies
8
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/evo-eval/evoeval"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Related tools
EvolvingLMMs-Lab/lmms-eval
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
vibrantlabsai/ragas
Supercharge Your LLM Application Evaluations 🚀
open-compass/VLMEvalKit
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
EuroEval/EuroEval
The robust European language model benchmark.
Giskard-AI/giskard-oss
🐢 Open-Source Evaluation & Testing library for LLM Agents