seal-research/OmniCode

OmniCode: A Diverse Software Engineering Benchmark for Evaluating Large Language Models

/ 100

Experimental

This project provides a standardized way to measure how well AI models, specifically Large Language Models (LLMs), perform at various software development tasks. It takes in a codebase and a problem description, then evaluates the LLM's ability to fix bugs, generate tests, apply style guidelines, or respond to code review feedback. Software engineering researchers and developers building AI-powered coding assistants would use this to benchmark and compare their models.

Use this if you are a researcher or developer who needs to rigorously evaluate the software engineering capabilities of a Large Language Model across different coding challenges.

Not ideal if you are looking for an AI assistant to help you write code or automate development tasks directly; this is purely a benchmarking tool.

LLM evaluation software engineering research code quality test automation AI development

No License No Package No Dependents

Maintenance 13 / 25

Adoption 5 / 25

Maturity 8 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

Python

License

—

Higher-rated alternatives

k4black/codebleu

Pip compatible CodeBLEU metric implementation available for linux/macos/win

LiveCodeBench/LiveCodeBench

Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of...

EdinburghNLP/code-docstring-corpus

Preprocessed Python functions and docstrings for automated code documentation (code2doc) and...

hendrycks/apps

APPS: Automated Programming Progress Standard (NeurIPS 2021)

solis-team/Hydra

[FSE 2026] Do Not Treat Code as Natural Language: Implications for Repository-Level Code...

Explore AI Coding Tools

All categories Trending AI Coding directory Insights