Tomiinek/MultiWOZ_Evaluation
Unified MultiWOZ evaluation scripts for the context-to-response task.
This tool helps researchers and developers evaluate how well their conversational AI models generate responses in a multi-turn dialogue system like MultiWOZ. You provide your model's generated responses and predicted dialogue states, and it calculates key metrics like BLEU score, Inform & Success rates, and lexical richness. It's designed for anyone working on improving dialogue systems, particularly those focused on response generation.
No commits in the last 6 months.
Use this if you need a standardized and easy-to-use way to evaluate the quality of responses generated by your conversational AI model on the MultiWOZ benchmark.
Not ideal if you are evaluating a dialogue system on a dataset other than MultiWOZ or primarily focused on metrics for tasks like intent recognition or entity extraction.
Stars
59
Forks
13
Language
Python
License
MIT
Category
Last pushed
Oct 11, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/Tomiinek/MultiWOZ_Evaluation"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
gunthercox/chatterbot-corpus
A multilingual dialog corpus
EdinburghNLP/awesome-hallucination-detection
List of papers on hallucination detection in LLMs.
jfainberg/self_dialogue_corpus
The Self-dialogue Corpus - a collection of self-dialogues across music, movies and sports
jkkummerfeld/irc-disentanglement
Dataset and model for disentangling chat on IRC
tae898/multimodal-datasets
Multimodal datasets.