princeton-nlp/LLMBar

[ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following

/ 100

Emerging

This project helps evaluate how well large language models (LLMs) can judge if another LLM's output correctly follows instructions. It takes an instruction, two different LLM responses to that instruction, and a 'gold standard' label of which response is objectively better. It then provides statistics on how accurately various LLMs, using different prompting methods, match these gold standard judgments. Anyone involved in developing, deploying, or selecting LLMs for tasks requiring precise instruction following would find this useful.

137 stars. No commits in the last 6 months.

Use this if you need to rigorously assess the capability of different LLMs to act as evaluators for instruction-following tasks, comparing their judgment accuracy against human-verified standards.

Not ideal if you're looking for a tool to generate instructions or fine-tune an LLM, as its purpose is strictly for evaluating the evaluation capabilities of LLMs.

LLM evaluation natural language processing AI model quality assurance instruction following model comparison

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 11 / 25

How are scores calculated?

Stars

137

Forks

Language

Python

License

MIT

Higher-rated alternatives

eth-sri/matharena

Evaluation of LLMs on latest math competitions

tatsu-lab/alpaca_eval

An automatic evaluator for instruction-following language models. Human-validated, high-quality,...

HPAI-BSC/TuRTLe

TuRTLe: A Unified Evaluation of LLMs for RTL Generation 🐢 (MLCAD 2025)

nlp-uoregon/mlmm-evaluation

Multilingual Large Language Models Evaluation Benchmark

haesleinhuepf/human-eval-bia

Benchmarking Large Language Models for Bio-Image Analysis Code Generation

Explore Transformer Models

All categories Trending Transformer directory Insights