TIGER-AI-Lab/MMLU-Pro

The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]

/ 100

Established

This project offers a sophisticated benchmark for evaluating large language models (LLMs) on complex reasoning tasks across 14 diverse academic domains. It takes an LLM's responses to multi-choice questions from academic exams and textbooks and outputs an accuracy score, revealing how well the model understands and reasons with expert-level knowledge. This is for AI researchers and developers who need to rigorously assess and compare the capabilities of different language models.

347 stars.

Use this if you need to thoroughly test and compare the advanced reasoning and knowledge understanding of various large language models using a challenging and robust dataset.

Not ideal if you are looking for a simple, quick evaluation of basic language comprehension or if your models are not designed for multi-choice academic reasoning.

large-language-models model-evaluation ai-benchmarking natural-language-understanding reasoning-tasks

No Package No Dependents

Maintenance 10 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 20 / 25

How are scores calculated?

Stars

347

Forks

Language

Python

License

Apache-2.0

Related models

ExtensityAI/symbolicai

A neurosymbolic perspective on LLMs

deep-symbolic-mathematics/LLM-SR

[ICLR 2025 Oral] This is the official repo for the paper "LLM-SR" on Scientific Equation...

microsoft/interwhen

A framework for verifiable reasoning with language models.

zhudotexe/fanoutqa

Companion code for FanOutQA: Multi-Hop, Multi-Document Question Answering for Large Language...

xlang-ai/Binder

[ICLR 2023] Code for the paper "Binding Language Models in Symbolic Languages"

Explore Transformer Models

All categories Trending Transformer directory Insights