Software-Engineering-Arena/SWE-Chatbot-Arena
Compare chatbots pairwise via multi‑round evaluations for SE tasks.
This tool helps software engineers evaluate large language models (LLMs) specifically for real-world software engineering tasks. You provide an SE task, optionally with a repository URL, and receive responses from two anonymous LLMs. After comparing their multi-round interactions, you vote for the one that performs better on activities like debugging, code review, or refactoring.
Use this if you need to compare different LLMs to see which one performs best on iterative software engineering workflows and understands repository context.
Not ideal if you are looking to evaluate general-purpose chatbot capabilities unrelated to coding or software development tasks.
Stars
13
Forks
—
Language
Python
License
—
Category
Last pushed
Feb 24, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/Software-Engineering-Arena/SWE-Chatbot-Arena"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
jeinlee1991/chinese-llm-benchmark
ReLE评测:中文AI大模型能力评测(持续更新):目前已囊括359个大模型,覆盖chatgpt、gpt-5.2、o4-mini、谷歌gemini-3-pro、Claude-4.6、文心ERNIE...
bvobart/mllint
`mllint` is a command-line utility to evaluate the technical quality of Python Machine Learning...
ApextheBoss/canary
🐤 Know when your LLM provider silently degrades. Automated quality testing for AI models. Like...
oolong-tea-2026/arena-ai-leaderboards
📊 Daily auto-updated snapshots of all Arena AI (LMSYS Chatbot Arena) leaderboards — LLM, Vision,...
abject-milkingmachine273/llm-cost-dashboard
Monitor LLM token costs in real time with a terminal dashboard offering per-request tracking,...