pillowsofwind/DebateQA

[EACL 2026] The official GitHub repo for the paper "DebateQA: Evaluating Question Answering on Debatable Knowledge"

/ 100

Experimental

When evaluating how well a Large Language Model (LLM) answers complex, debatable questions, this tool helps you measure the quality of its responses. You input an LLM's generated answers to a set of debatable questions, and it outputs scores for how comprehensive and balanced those answers are. This is for researchers and developers working on LLM evaluation and responsible AI.

Use this if you need to objectively quantify how well an LLM handles questions with multiple valid perspectives or acknowledges the contentious nature of a topic.

Not ideal if you are looking for a tool to generate debates or synthesize different viewpoints, as this is purely for evaluation.

LLM evaluation NLP research Generative AI testing Responsible AI Question answering systems

No License No Package No Dependents

Maintenance 10 / 25

Adoption 5 / 25

Maturity 8 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

Python

License

—

Higher-rated alternatives

asahi417/lm-question-generation

Multilingual/multidomain question generation datasets, models, and python library for question...

SparkJiao/SLQA

An Unofficial Pytorch Implementation of Multi-Granularity Hierarchical Attention Fusion Networks...

MurtyShikhar/Question-Answering

TensorFlow implementation of Match-LSTM and Answer pointer for the popular SQuAD dataset.

hsinyuan-huang/FlowQA

Implementation of conversational QA model: FlowQA (with slight improvement)

allenai/aokvqa

Official repository for the A-OKVQA dataset

Explore NLP Tools

All categories Trending NLP directory Insights