Software-Engineering-Arena/SWE-Chatbot-Arena

Compare chatbots pairwise via multi‑round evaluations for SE tasks.

/ 100

Experimental

This tool helps software engineers evaluate large language models (LLMs) specifically for real-world software engineering tasks. You provide an SE task, optionally with a repository URL, and receive responses from two anonymous LLMs. After comparing their multi-round interactions, you vote for the one that performs better on activities like debugging, code review, or refactoring.

Use this if you need to compare different LLMs to see which one performs best on iterative software engineering workflows and understands repository context.

Not ideal if you are looking to evaluate general-purpose chatbot capabilities unrelated to coding or software development tasks.

software-engineering LLM-evaluation code-review debugging developer-tools

No License No Package No Dependents

Maintenance 10 / 25

Adoption 5 / 25

Maturity 8 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

Python

License

—

Higher-rated alternatives

jeinlee1991/chinese-llm-benchmark

ReLE评测：中文AI大模型能力评测（持续更新）：目前已囊括359个大模型，覆盖chatgpt、gpt-5.2、o4-mini、谷歌gemini-3-pro、Claude-4.6、文心ERNIE...

bvobart/mllint

`mllint` is a command-line utility to evaluate the technical quality of Python Machine Learning...

ApextheBoss/canary

🐤 Know when your LLM provider silently degrades. Automated quality testing for AI models. Like...

oolong-tea-2026/arena-ai-leaderboards

📊 Daily auto-updated snapshots of all Arena AI (LMSYS Chatbot Arena) leaderboards — LLM, Vision,...

abject-milkingmachine273/llm-cost-dashboard

Monitor LLM token costs in real time with a terminal dashboard offering per-request tracking,...

Explore LLM Tools

All categories Trending LLM Tool directory Insights