research-outcome/LLM-Game-Benchmark

Evaluating Large Language Models with Grid-Based Game Competitions: An Extensible LLM Benchmark and Leaderboard

/ 100

Emerging

This project helps evaluate how well different Large Language Models (LLMs) perform in strategic, grid-based games like Tic-Tac-Toe, Connect Four, and Gomoku. You input an LLM (via its API key) and it plays against other LLMs or itself, with the output being a detailed record of game outcomes and a comparative leaderboard. AI researchers, machine learning engineers, and data scientists developing or using LLMs would use this.

No commits in the last 6 months.

Use this if you want to benchmark the strategic capabilities of various LLMs in a controlled, competitive game environment.

Not ideal if you're looking to evaluate LLMs on tasks beyond simple strategic grid-based games, such as text generation or complex reasoning.

LLM evaluation AI benchmarking machine learning research game AI model comparison

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 6 / 25

Maturity 16 / 25

Community 10 / 25

How are scores calculated?

Stars

Forks

Language

JavaScript

License

—

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

sierra-research/tau2-bench

τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment

xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

bigcode-project/bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

scicode-bench/SciCode

A benchmark that challenges language models to code solutions for scientific problems

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

Explore LLM Tools

All categories Trending LLM Tool directory Insights