FreedomIntelligence/MTalk-Bench

MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via Arena-style and Rubrics Protocols

/ 100

Emerging

This tool helps researchers and developers assess how well their speech-to-speech AI models perform in realistic, multi-turn conversations. You provide your model's audio responses to a set of conversational prompts, and the benchmark evaluates them across semantic understanding, paralinguistic aspects like tone, and ambient sound interaction. It's designed for AI researchers and engineers building and refining conversational AI and large language models.

Use this if you are developing or evaluating speech-to-speech AI models and need a comprehensive way to benchmark their performance in dynamic, multi-turn dialogue scenarios.

Not ideal if you are looking for a tool to simply transcribe audio or translate speech, as its primary purpose is advanced model evaluation.

conversational-ai speech-technology ai-model-evaluation natural-language-processing large-language-models

No Package No Dependents

Maintenance 6 / 25

Adoption 6 / 25

Maturity 16 / 25

Community 5 / 25

How are scores calculated?

Stars

Forks

Language

JavaScript

License

Apache-2.0

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

sierra-research/tau2-bench

τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment

xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

bigcode-project/bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

scicode-bench/SciCode

A benchmark that challenges language models to code solutions for scientific problems

Explore LLM Tools

All categories Trending LLM Tool directory Insights