jordan-gibbs/secret-hitler-bench

An LLM benchmark based on the popular social deception game, Secret Hitler. Test intelligence, long context planning, logic, and duplicitous capabilities of popular AI models.

/ 100

Experimental

This project simulates full 8-player games of Secret Hitler with AI agents driven by large language models. You can test how well different AI models perform at deception, strategic thinking, and social deduction. It takes your chosen language models as input and outputs detailed game logs, win rates, and player statistics, viewable in a live web interface. Researchers, AI evaluators, and anyone interested in the social intelligence of AI would use this.

Use this if you want to rigorously evaluate the lying, strategic planning, and social interaction capabilities of various large language models in a complex game setting.

Not ideal if you are looking for a free simulation, as running LLM-powered games can be very expensive.

AI evaluation social intelligence deception research game theory LLM testing

No Package No Dependents

Maintenance 13 / 25

Adoption 4 / 25

Maturity 9 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

Python

License

—

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

sierra-research/tau2-bench

τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment

xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

bigcode-project/bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

scicode-bench/SciCode

A benchmark that challenges language models to code solutions for scientific problems

Explore LLM Tools

All categories Trending LLM Tool directory Insights