TIGER-AI-Lab/MMLU-Pro

The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]

56
/ 100
Established

This project offers a sophisticated benchmark for evaluating large language models (LLMs) on complex reasoning tasks across 14 diverse academic domains. It takes an LLM's responses to multi-choice questions from academic exams and textbooks and outputs an accuracy score, revealing how well the model understands and reasons with expert-level knowledge. This is for AI researchers and developers who need to rigorously assess and compare the capabilities of different language models.

347 stars.

Use this if you need to thoroughly test and compare the advanced reasoning and knowledge understanding of various large language models using a challenging and robust dataset.

Not ideal if you are looking for a simple, quick evaluation of basic language comprehension or if your models are not designed for multi-choice academic reasoning.

large-language-models model-evaluation ai-benchmarking natural-language-understanding reasoning-tasks
No Package No Dependents
Maintenance 10 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 20 / 25

How are scores calculated?

Stars

347

Forks

54

Language

Python

License

Apache-2.0

Last pushed

Feb 20, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/TIGER-AI-Lab/MMLU-Pro"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.