nttmdlab-nlp/ToMATO

ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of Mind (AAAI2025)

/ 100

Experimental

This project offers a specialized benchmark to assess how well large language models (LLMs) understand and predict the thoughts, beliefs, and intentions of others—a capability known as 'Theory of Mind.' It uses scenarios where LLMs interact with each other under different knowledge conditions, providing a dataset to evaluate an LLM's capacity for complex social reasoning. LLM researchers and developers focused on advanced AI capabilities would use this.

No commits in the last 6 months.

Use this if you are an AI researcher or developer evaluating the 'Theory of Mind' capabilities of your large language models in realistic, conversational settings.

Not ideal if you are looking for a dataset to fine-tune your LLM, as this benchmark is strictly for evaluation to prevent contamination.

LLM evaluation AI research cognitive AI natural language processing AI ethics and bias

No License Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 6 / 25

Maturity 8 / 25

Community 12 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

sierra-research/tau2-bench

τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment

xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

bigcode-project/bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

scicode-bench/SciCode

A benchmark that challenges language models to code solutions for scientific problems

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

Explore LLM Tools

All categories Trending LLM Tool directory Insights