FudanSELab/ClassEval

Benchmark ClassEval for class-level code generation.

39
/ 100
Emerging

This benchmark helps researchers and developers evaluate how well large language models (LLMs) can generate complete, working Python classes. It takes a class skeleton (including descriptions and method signatures) and tests, then outputs metrics like Pass@K to show the LLM's code generation accuracy. Anyone working on improving or comparing LLMs for code generation would use this.

145 stars. No commits in the last 6 months.

Use this if you need a standardized, comprehensive way to measure an LLM's ability to generate production-ready Python classes with diverse dependencies and complexities.

Not ideal if you're evaluating LLMs for single-line code completion or simple function generation rather than full class structures.

LLM evaluation code generation benchmarking AI model comparison software engineering research python development
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 13 / 25

How are scores calculated?

Stars

145

Forks

15

Language

Python

License

MIT

Last pushed

Oct 24, 2024

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/FudanSELab/ClassEval"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.