eth-sri/matharena

Evaluation of LLMs on latest math competitions

/ 100

Established

This is a platform for evaluating how well different large language models (LLMs) perform on challenging math competitions and olympiads. You provide a competition (like AIME or Project Euler) and one or more LLMs, and it outputs detailed evaluation results, including whether the model's answers are correct and the reasoning steps it took. Anyone researching or developing advanced AI models for complex problem-solving can use this to benchmark their models.

229 stars.

Use this if you need to rigorously test and compare the mathematical reasoning capabilities of various LLMs on standardized competition problems.

Not ideal if you're looking for a general-purpose math solver for everyday calculations or a tool for teaching basic math concepts.

AI-model-evaluation mathematical-reasoning LLM-benchmarking AI-research competitive-math

No Package No Dependents

Maintenance 10 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 16 / 25

How are scores calculated?

Stars

229

Forks

Language

Python

License

MIT

Related models

tatsu-lab/alpaca_eval

An automatic evaluator for instruction-following language models. Human-validated, high-quality,...

HPAI-BSC/TuRTLe

TuRTLe: A Unified Evaluation of LLMs for RTL Generation 🐢 (MLCAD 2025)

nlp-uoregon/mlmm-evaluation

Multilingual Large Language Models Evaluation Benchmark

haesleinhuepf/human-eval-bia

Benchmarking Large Language Models for Bio-Image Analysis Code Generation

ShuntaroOkuma/adapt-gauge-core

Measure LLM adaptation efficiency — how fast models learn from few examples

Explore Transformer Models

All categories Trending Transformer directory Insights