eth-lre/mathtutorbench

Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors, EMNLP 2025 Oral

/ 100

Emerging

This project provides a standardized way to test how well AI language models can act as math tutors. You input a specific math tutoring AI model, and it produces a detailed report on its performance across seven key teaching skills, such as problem-solving assistance or mistake correction. Educators, instructional designers, and AI developers building educational tools would use this to understand and improve their AI tutors.

Use this if you are developing or evaluating an AI model designed to tutor students in mathematics and need a comprehensive, automated way to assess its pedagogical effectiveness.

Not ideal if you are looking for a general-purpose AI evaluation tool or a benchmark for non-math related educational AI.

AI-in-education math-tutoring pedagogical-assessment language-model-evaluation instructional-AI

No License No Package No Dependents

Maintenance 6 / 25

Adoption 7 / 25

Maturity 8 / 25

Community 17 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

sierra-research/tau2-bench

τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment

xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

bigcode-project/bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

scicode-bench/SciCode

A benchmark that challenges language models to code solutions for scientific problems

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

Explore LLM Tools

All categories Trending LLM Tool directory Insights