nttmdlab-nlp/ToMATO
ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of Mind (AAAI2025)
This project offers a specialized benchmark to assess how well large language models (LLMs) understand and predict the thoughts, beliefs, and intentions of others—a capability known as 'Theory of Mind.' It uses scenarios where LLMs interact with each other under different knowledge conditions, providing a dataset to evaluate an LLM's capacity for complex social reasoning. LLM researchers and developers focused on advanced AI capabilities would use this.
No commits in the last 6 months.
Use this if you are an AI researcher or developer evaluating the 'Theory of Mind' capabilities of your large language models in realistic, conversational settings.
Not ideal if you are looking for a dataset to fine-tune your LLM, as this benchmark is strictly for evaluation to prevent contamination.
Stars
19
Forks
3
Language
Python
License
—
Category
Last pushed
Apr 16, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/nttmdlab-nlp/ToMATO"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
sierra-research/tau2-bench
τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment
xlang-ai/OSWorld
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
bigcode-project/bigcodebench
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
scicode-bench/SciCode
A benchmark that challenges language models to code solutions for scientific problems
THUDM/AgentBench
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)