research-outcome/LLM-Game-Benchmark

Evaluating Large Language Models with Grid-Based Game Competitions: An Extensible LLM Benchmark and Leaderboard

32
/ 100
Emerging

This project helps evaluate how well different Large Language Models (LLMs) perform in strategic, grid-based games like Tic-Tac-Toe, Connect Four, and Gomoku. You input an LLM (via its API key) and it plays against other LLMs or itself, with the output being a detailed record of game outcomes and a comparative leaderboard. AI researchers, machine learning engineers, and data scientists developing or using LLMs would use this.

No commits in the last 6 months.

Use this if you want to benchmark the strategic capabilities of various LLMs in a controlled, competitive game environment.

Not ideal if you're looking to evaluate LLMs on tasks beyond simple strategic grid-based games, such as text generation or complex reasoning.

LLM evaluation AI benchmarking machine learning research game AI model comparison
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 6 / 25
Maturity 16 / 25
Community 10 / 25

How are scores calculated?

Stars

24

Forks

3

Language

JavaScript

License

Last pushed

Dec 14, 2024

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/research-outcome/LLM-Game-Benchmark"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.