microsoft/SWE-bench-Live
[NeurIPS 2025 D&B] 🚀 SWE-bench Goes Live!
This project helps AI researchers and developers evaluate how well their AI systems can resolve real-world software engineering issues. It takes an AI model's proposed code changes (patches) for identified software bugs or tasks across various languages and platforms, and then provides a benchmarked evaluation of its performance against a continuously updated dataset of real-world problems. This tool is for AI system developers and researchers focused on building and improving AI-powered software development tools and agents.
170 stars.
Use this if you are developing AI models designed to fix software bugs or implement new features and need a robust, current, and objective way to measure their performance.
Not ideal if you are a software developer looking for a tool to help you personally fix bugs or automate your daily coding tasks, as this is an evaluation framework for AI systems.
Stars
170
Forks
23
Language
Python
License
MIT
Category
Last pushed
Mar 09, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/microsoft/SWE-bench-Live"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Related tools
sierra-research/tau2-bench
τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment
xlang-ai/OSWorld
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
bigcode-project/bigcodebench
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
scicode-bench/SciCode
A benchmark that challenges language models to code solutions for scientific problems
THUDM/AgentBench
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)