princeton-nlp/LLMBar
[ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following
This project helps evaluate how well large language models (LLMs) can judge if another LLM's output correctly follows instructions. It takes an instruction, two different LLM responses to that instruction, and a 'gold standard' label of which response is objectively better. It then provides statistics on how accurately various LLMs, using different prompting methods, match these gold standard judgments. Anyone involved in developing, deploying, or selecting LLMs for tasks requiring precise instruction following would find this useful.
137 stars. No commits in the last 6 months.
Use this if you need to rigorously assess the capability of different LLMs to act as evaluators for instruction-following tasks, comparing their judgment accuracy against human-verified standards.
Not ideal if you're looking for a tool to generate instructions or fine-tune an LLM, as its purpose is strictly for evaluating the evaluation capabilities of LLMs.
Stars
137
Forks
11
Language
Python
License
MIT
Category
Last pushed
Jul 08, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/princeton-nlp/LLMBar"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
eth-sri/matharena
Evaluation of LLMs on latest math competitions
tatsu-lab/alpaca_eval
An automatic evaluator for instruction-following language models. Human-validated, high-quality,...
HPAI-BSC/TuRTLe
TuRTLe: A Unified Evaluation of LLMs for RTL Generation 🐢 (MLCAD 2025)
nlp-uoregon/mlmm-evaluation
Multilingual Large Language Models Evaluation Benchmark
haesleinhuepf/human-eval-bia
Benchmarking Large Language Models for Bio-Image Analysis Code Generation