UMass-Meta-LLM-Eval/llm_eval

A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup that reveals new results about its strengths and weaknesses.

/ 100

Experimental

This project helps evaluate how well large language models (LLMs) perform when tasked with judging the quality of answers generated by other LLMs, which is a common practice for scaling up evaluations. It takes in configurations for various judge and exam-taker LLMs and benchmarks, then produces insights into the judge models' alignment with human judgments and their potential biases. This tool is for researchers and practitioners in AI who need to assess the reliability and fairness of LLMs acting as evaluators.

No commits in the last 6 months.

Use this if you are using LLMs to evaluate other LLMs and need a rigorous way to understand the strengths, weaknesses, and biases of these 'LLM-as-a-judge' systems.

Not ideal if you are looking for a simple tool to compare the raw performance of different LLMs on a specific task without focusing on the 'LLM-as-a-judge' evaluation paradigm.

LLM-evaluation AI-benchmarking model-bias natural-language-processing AI-research

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 5 / 25

Maturity 8 / 25

Community 8 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

eth-sri/matharena

Evaluation of LLMs on latest math competitions

tatsu-lab/alpaca_eval

An automatic evaluator for instruction-following language models. Human-validated, high-quality,...

HPAI-BSC/TuRTLe

TuRTLe: A Unified Evaluation of LLMs for RTL Generation 🐢 (MLCAD 2025)

nlp-uoregon/mlmm-evaluation

Multilingual Large Language Models Evaluation Benchmark

haesleinhuepf/human-eval-bia

Benchmarking Large Language Models for Bio-Image Analysis Code Generation

Explore Transformer Models

All categories Trending Transformer directory Insights