pillowsofwind/DebateQA

[EACL 2026] The official GitHub repo for the paper "DebateQA: Evaluating Question Answering on Debatable Knowledge"

23
/ 100
Experimental

When evaluating how well a Large Language Model (LLM) answers complex, debatable questions, this tool helps you measure the quality of its responses. You input an LLM's generated answers to a set of debatable questions, and it outputs scores for how comprehensive and balanced those answers are. This is for researchers and developers working on LLM evaluation and responsible AI.

Use this if you need to objectively quantify how well an LLM handles questions with multiple valid perspectives or acknowledges the contentious nature of a topic.

Not ideal if you are looking for a tool to generate debates or synthesize different viewpoints, as this is purely for evaluation.

LLM evaluation NLP research Generative AI testing Responsible AI Question answering systems
No License No Package No Dependents
Maintenance 10 / 25
Adoption 5 / 25
Maturity 8 / 25
Community 0 / 25

How are scores calculated?

Stars

11

Forks

Language

Python

License

Last pushed

Jan 16, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/pillowsofwind/DebateQA"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.