TianboJi/Dialogue-Eval

Code and data for paper "Achieving Reliable Human Assessment of Open-Domain Dialogue Systems"

/ 100

Emerging

This tool helps researchers and developers reliably evaluate different open-domain dialogue systems using human feedback. It takes your collected dialogue data and associated human ratings (e.g., for interestingness, fluency, robotic-ness) in a JSON format. It then outputs statistical reports like system Z-scores, rater agreement metrics, and significance test visualizations to compare the performance of different conversational AI models. This is ideal for anyone developing or researching conversational AI systems.

No commits in the last 6 months.

Use this if you need to rigorously analyze human evaluation data to compare multiple open-domain dialogue systems and ensure the reliability of your assessment.

Not ideal if you are looking for a tool to collect human feedback or if your evaluation criteria are not numerical ratings.

conversational-ai dialogue-system-evaluation human-in-the-loop-evaluation NLP-research chatbot-development

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 4 / 25

Maturity 16 / 25

Community 13 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Higher-rated alternatives

gunthercox/chatterbot-corpus

A multilingual dialog corpus

EdinburghNLP/awesome-hallucination-detection

List of papers on hallucination detection in LLMs.

jfainberg/self_dialogue_corpus

The Self-dialogue Corpus - a collection of self-dialogues across music, movies and sports

jkkummerfeld/irc-disentanglement

Dataset and model for disentangling chat on IRC

Tomiinek/MultiWOZ_Evaluation

Unified MultiWOZ evaluation scripts for the context-to-response task.

Explore NLP Tools

All categories Trending NLP directory Insights