FudanSELab/ClassEval

Benchmark ClassEval for class-level code generation.

/ 100

Emerging

This benchmark helps researchers and developers evaluate how well large language models (LLMs) can generate complete, working Python classes. It takes a class skeleton (including descriptions and method signatures) and tests, then outputs metrics like Pass@K to show the LLM's code generation accuracy. Anyone working on improving or comparing LLMs for code generation would use this.

145 stars. No commits in the last 6 months.

Use this if you need a standardized, comprehensive way to measure an LLM's ability to generate production-ready Python classes with diverse dependencies and complexities.

Not ideal if you're evaluating LLMs for single-line code completion or simple function generation rather than full class structures.

LLM evaluation code generation benchmarking AI model comparison software engineering research python development

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 13 / 25

How are scores calculated?

Stars

145

Forks

Language

Python

License

MIT

Related tools

microsoft/NeMoEval

A Benchmark Tool for Natural Language-based Network Management

apartresearch/specificityplus

👩‍💻 Code for the ACL paper "Detecting Edit Failures in LLMs: An Improved Specificity Benchmark"

claws-lab/XLingEval

Code and Resources for the paper, "Better to Ask in English: Cross-Lingual Evaluation of Large...

HICAI-ZJU/SciKnowEval

SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models

nicolay-r/RuSentRel-Leaderboard

This is an official Leaderboard for the RuSentRel-1.1 dataset originally described in paper...

Explore NLP Tools

All categories Trending NLP directory Insights