princeton-nlp/LLMBar

[ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following

37
/ 100
Emerging

This project helps evaluate how well large language models (LLMs) can judge if another LLM's output correctly follows instructions. It takes an instruction, two different LLM responses to that instruction, and a 'gold standard' label of which response is objectively better. It then provides statistics on how accurately various LLMs, using different prompting methods, match these gold standard judgments. Anyone involved in developing, deploying, or selecting LLMs for tasks requiring precise instruction following would find this useful.

137 stars. No commits in the last 6 months.

Use this if you need to rigorously assess the capability of different LLMs to act as evaluators for instruction-following tasks, comparing their judgment accuracy against human-verified standards.

Not ideal if you're looking for a tool to generate instructions or fine-tune an LLM, as its purpose is strictly for evaluating the evaluation capabilities of LLMs.

LLM evaluation natural language processing AI model quality assurance instruction following model comparison
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 11 / 25

How are scores calculated?

Stars

137

Forks

11

Language

Python

License

MIT

Last pushed

Jul 08, 2024

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/princeton-nlp/LLMBar"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.