research-outcome/LLM-Game-Benchmark
Evaluating Large Language Models with Grid-Based Game Competitions: An Extensible LLM Benchmark and Leaderboard
This project helps evaluate how well different Large Language Models (LLMs) perform in strategic, grid-based games like Tic-Tac-Toe, Connect Four, and Gomoku. You input an LLM (via its API key) and it plays against other LLMs or itself, with the output being a detailed record of game outcomes and a comparative leaderboard. AI researchers, machine learning engineers, and data scientists developing or using LLMs would use this.
No commits in the last 6 months.
Use this if you want to benchmark the strategic capabilities of various LLMs in a controlled, competitive game environment.
Not ideal if you're looking to evaluate LLMs on tasks beyond simple strategic grid-based games, such as text generation or complex reasoning.
Stars
24
Forks
3
Language
JavaScript
License
—
Category
Last pushed
Dec 14, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/research-outcome/LLM-Game-Benchmark"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
sierra-research/tau2-bench
τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment
xlang-ai/OSWorld
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
bigcode-project/bigcodebench
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
scicode-bench/SciCode
A benchmark that challenges language models to code solutions for scientific problems
THUDM/AgentBench
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)