microsoft/SWE-bench-Live

[NeurIPS 2025 D&B] 🚀 SWE-bench Goes Live!

/ 100

Established

This project helps AI researchers and developers evaluate how well their AI systems can resolve real-world software engineering issues. It takes an AI model's proposed code changes (patches) for identified software bugs or tasks across various languages and platforms, and then provides a benchmarked evaluation of its performance against a continuously updated dataset of real-world problems. This tool is for AI system developers and researchers focused on building and improving AI-powered software development tools and agents.

170 stars.

Use this if you are developing AI models designed to fix software bugs or implement new features and need a robust, current, and objective way to measure their performance.

Not ideal if you are a software developer looking for a tool to help you personally fix bugs or automate your daily coding tasks, as this is an evaluation framework for AI systems.

AI-evaluation software-engineering-AI LLM-benchmarking code-repair-AI AI-developer-tools

No Package No Dependents

Maintenance 10 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 17 / 25

How are scores calculated?

Stars

170

Forks

Language

Python

License

MIT

Featured in

You're Shipping AI You Can't Measure

Related tools

sierra-research/tau2-bench

τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment

xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

bigcode-project/bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

scicode-bench/SciCode

A benchmark that challenges language models to code solutions for scientific problems

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

Explore LLM Tools

All categories Trending LLM Tool directory Insights