domaineval/DomainEval

DOMAINEVAL is an auto-constructed benchmark for multi-domain code generation that consists of 2k+ subjects (i.e., description, reference code and tests) covering six domains (i.e., Computation, Basic, Network, Cryptography, Visualization, System).

/ 100

Experimental

This project helps evaluate how well AI models can generate code across various programming tasks, from basic computations to network operations, cryptography, and visualization. It takes in a code generation AI model's output and evaluates its accuracy against a diverse set of real-world coding problems. Software engineers and researchers developing or using large language models for code generation would use this to benchmark performance.

No commits in the last 6 months.

Use this if you need to rigorously test and compare the code generation capabilities of different AI models across a wide range of programming domains.

Not ideal if you are looking for a tool to generate code directly for your own applications, as this is solely for benchmarking existing models.

code-generation LLM-benchmarking software-engineering AI-model-evaluation programming-task-assessment

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 5 / 25

Maturity 8 / 25

Community 14 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

k4black/codebleu

Pip compatible CodeBLEU metric implementation available for linux/macos/win

LiveCodeBench/LiveCodeBench

Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of...

EdinburghNLP/code-docstring-corpus

Preprocessed Python functions and docstrings for automated code documentation (code2doc) and...

hendrycks/apps

APPS: Automated Programming Progress Standard (NeurIPS 2021)

solis-team/Hydra

[FSE 2026] Do Not Treat Code as Natural Language: Implications for Repository-Level Code...

Explore AI Coding Tools

All categories Trending AI Coding directory Insights