eth-sri/matharena
Evaluation of LLMs on latest math competitions
This is a platform for evaluating how well different large language models (LLMs) perform on challenging math competitions and olympiads. You provide a competition (like AIME or Project Euler) and one or more LLMs, and it outputs detailed evaluation results, including whether the model's answers are correct and the reasoning steps it took. Anyone researching or developing advanced AI models for complex problem-solving can use this to benchmark their models.
229 stars.
Use this if you need to rigorously test and compare the mathematical reasoning capabilities of various LLMs on standardized competition problems.
Not ideal if you're looking for a general-purpose math solver for everyday calculations or a tool for teaching basic math concepts.
Stars
229
Forks
29
Language
Python
License
MIT
Category
Last pushed
Mar 10, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/eth-sri/matharena"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related models
tatsu-lab/alpaca_eval
An automatic evaluator for instruction-following language models. Human-validated, high-quality,...
HPAI-BSC/TuRTLe
TuRTLe: A Unified Evaluation of LLMs for RTL Generation 🐢 (MLCAD 2025)
nlp-uoregon/mlmm-evaluation
Multilingual Large Language Models Evaluation Benchmark
haesleinhuepf/human-eval-bia
Benchmarking Large Language Models for Bio-Image Analysis Code Generation
ShuntaroOkuma/adapt-gauge-core
Measure LLM adaptation efficiency — how fast models learn from few examples