zhchen18/ToMBench

ToMBench: Benchmarking Theory of Mind in Large Language Models, ACL 2024.

/ 100

Emerging

This project provides a comprehensive benchmark to evaluate how well large language models (LLMs) understand human-like social intelligence, often called 'Theory of Mind'. It helps researchers and AI developers assess an LLM's ability to interpret complex social scenarios, motivations, and non-literal communication. You provide an LLM's responses to various social prompts, and the benchmark quantifies its 'Theory of Mind' capabilities across different tasks and abilities.

No commits in the last 6 months.

Use this if you are developing or evaluating large language models and need a systematic way to measure their social intelligence, particularly their ability to infer mental states, understand emotions, and interpret non-literal communication in diverse real-world social scenarios.

Not ideal if you are looking for a dataset to train an LLM for specific social tasks, as this benchmark is designed purely for evaluation to prevent data contamination.

AI evaluation LLM capabilities social intelligence cognitive AI natural language understanding

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 8 / 25

Maturity 16 / 25

Community 11 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Higher-rated alternatives

stanfordnlp/axbench

Stanford NLP Python library for benchmarking the utility of LLM interpretability methods

aidatatools/ollama-benchmark

LLM Benchmark for Throughput via Ollama (Local LLMs)

LarHope/ollama-benchmark

Ollama based Benchmark with detail I/O token per second. Python with Deepseek R1 example.

qcri/LLMeBench

Benchmarking Large Language Models

THUDM/LongBench

LongBench v2 and LongBench (ACL 25'&24')

Explore Transformer Models

All categories Trending Transformer directory Insights