logic-star-ai/swt-bench

[NeurIPS 2024] Evaluation harness for SWT-Bench, a benchmark for evaluating LLM repository-level test-generation

/ 100

Established

This tool helps developers and researchers evaluate how well large language models can automatically generate tests for software issues found on GitHub. You provide a codebase and an issue description, and the LLM attempts to produce a 'reproducing test' that fails before a bug fix and passes afterward. The output is a performance report detailing the LLM's success rate and other metrics, useful for comparing different LLMs or development approaches.

Use this if you are a software developer, researcher, or AI engineer who needs to benchmark and compare the effectiveness of large language models at generating automated tests for real-world software bugs.

Not ideal if you are looking for a general-purpose bug-reporting tool or an environment to write manual software tests.

software-testing LLM-evaluation bug-reproduction developer-tools code-generation

No Package No Dependents

Maintenance 10 / 25

Adoption 9 / 25

Maturity 16 / 25

Community 19 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Featured in

You're Shipping AI You Can't Measure

Related tools

sierra-research/tau2-bench

τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment

xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

bigcode-project/bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

scicode-bench/SciCode

A benchmark that challenges language models to code solutions for scientific problems

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

Explore LLM Tools

All categories Trending LLM Tool directory Insights