UMass-Meta-LLM-Eval/llm_eval
A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup that reveals new results about its strengths and weaknesses.
This project helps evaluate how well large language models (LLMs) perform when tasked with judging the quality of answers generated by other LLMs, which is a common practice for scaling up evaluations. It takes in configurations for various judge and exam-taker LLMs and benchmarks, then produces insights into the judge models' alignment with human judgments and their potential biases. This tool is for researchers and practitioners in AI who need to assess the reliability and fairness of LLMs acting as evaluators.
No commits in the last 6 months.
Use this if you are using LLMs to evaluate other LLMs and need a rigorous way to understand the strengths, weaknesses, and biases of these 'LLM-as-a-judge' systems.
Not ideal if you are looking for a simple tool to compare the raw performance of different LLMs on a specific task without focusing on the 'LLM-as-a-judge' evaluation paradigm.
Stars
9
Forks
1
Language
Python
License
—
Category
Last pushed
Oct 01, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/UMass-Meta-LLM-Eval/llm_eval"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
eth-sri/matharena
Evaluation of LLMs on latest math competitions
tatsu-lab/alpaca_eval
An automatic evaluator for instruction-following language models. Human-validated, high-quality,...
HPAI-BSC/TuRTLe
TuRTLe: A Unified Evaluation of LLMs for RTL Generation 🐢 (MLCAD 2025)
nlp-uoregon/mlmm-evaluation
Multilingual Large Language Models Evaluation Benchmark
haesleinhuepf/human-eval-bia
Benchmarking Large Language Models for Bio-Image Analysis Code Generation