UMass-Meta-LLM-Eval/llm_eval

A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup that reveals new results about its strengths and weaknesses.

21
/ 100
Experimental

This project helps evaluate how well large language models (LLMs) perform when tasked with judging the quality of answers generated by other LLMs, which is a common practice for scaling up evaluations. It takes in configurations for various judge and exam-taker LLMs and benchmarks, then produces insights into the judge models' alignment with human judgments and their potential biases. This tool is for researchers and practitioners in AI who need to assess the reliability and fairness of LLMs acting as evaluators.

No commits in the last 6 months.

Use this if you are using LLMs to evaluate other LLMs and need a rigorous way to understand the strengths, weaknesses, and biases of these 'LLM-as-a-judge' systems.

Not ideal if you are looking for a simple tool to compare the raw performance of different LLMs on a specific task without focusing on the 'LLM-as-a-judge' evaluation paradigm.

LLM-evaluation AI-benchmarking model-bias natural-language-processing AI-research
No License Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 5 / 25
Maturity 8 / 25
Community 8 / 25

How are scores calculated?

Stars

9

Forks

1

Language

Python

License

Last pushed

Oct 01, 2024

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/UMass-Meta-LLM-Eval/llm_eval"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.