lechmazur/debate

Adversarial multi-turn benchmark for LLM debate quality, using side-swapped matchups and multi-model judging to rank models by judged debate performance.

/ 100

Experimental

This benchmark helps developers and researchers evaluate how well different large language models (LLMs) perform in multi-turn, adversarial debates across diverse topics. It takes various LLMs and a set of debate propositions as input, then outputs a leaderboard ranking models based on their ability to argue, rebut, and maintain coherence under pressure. This is for anyone creating, deploying, or researching LLMs who needs to understand their true argumentative capabilities.

Use this if you need to understand how well an LLM can defend a position, adapt to counterarguments, and maintain a coherent argument over multiple turns in a debate, rather than just providing a single good answer.

Not ideal if you are looking for a simple, one-shot evaluation of an LLM's general knowledge or immediate question-answering ability without the pressure of an adversarial exchange.

LLM evaluation conversational AI model comparison argumentation assessment AI research

No License No Package No Dependents

Maintenance 13 / 25

Adoption 4 / 25

Maturity 1 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

—

License

—

Higher-rated alternatives

betagouv/ComparIA

Open source LLM arena created by the French Government

Skytliang/Multi-Agents-Debate

MAD: The first work to explore Multi-Agent Debate with Large Language Models :D

liuxiaotong/ai-dataset-radar

Multi-source async competitive intelligence engine for AI training data ecosystems with...

Arnoldlarry15/ARES-Dashboard

AI Red Team Operations Console

llm-ring/lmring

Open-source, self-hostable LLM arena with model compare, voting, and leaderboards

Explore LLM Tools

All categories Trending LLM Tool directory Insights