TIGER-AI-Lab/MMLU-Pro
The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]
This project offers a sophisticated benchmark for evaluating large language models (LLMs) on complex reasoning tasks across 14 diverse academic domains. It takes an LLM's responses to multi-choice questions from academic exams and textbooks and outputs an accuracy score, revealing how well the model understands and reasons with expert-level knowledge. This is for AI researchers and developers who need to rigorously assess and compare the capabilities of different language models.
347 stars.
Use this if you need to thoroughly test and compare the advanced reasoning and knowledge understanding of various large language models using a challenging and robust dataset.
Not ideal if you are looking for a simple, quick evaluation of basic language comprehension or if your models are not designed for multi-choice academic reasoning.
Stars
347
Forks
54
Language
Python
License
Apache-2.0
Category
Last pushed
Feb 20, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/TIGER-AI-Lab/MMLU-Pro"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related models
ExtensityAI/symbolicai
A neurosymbolic perspective on LLMs
deep-symbolic-mathematics/LLM-SR
[ICLR 2025 Oral] This is the official repo for the paper "LLM-SR" on Scientific Equation...
microsoft/interwhen
A framework for verifiable reasoning with language models.
zhudotexe/fanoutqa
Companion code for FanOutQA: Multi-Hop, Multi-Document Question Answering for Large Language...
xlang-ai/Binder
[ICLR 2023] Code for the paper "Binding Language Models in Symbolic Languages"