tatsu-lab/alpaca_eval

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

/ 100

Established

This project helps evaluate how well your instruction-following language model (like a chatbot) performs compared to others. You input your model's responses to a set of instructions, and it provides a win-rate score against a reference model, indicating its quality. This tool is for AI researchers and developers who are building or fine-tuning large language models and need to quickly assess their performance.

1,957 stars. No commits in the last 6 months.

Use this if you are developing or fine-tuning instruction-following language models and need a fast, affordable, and highly correlated automatic evaluation method to guide your iteration cycles.

Not ideal if you need a definitive evaluation for high-stakes decisions like model release, as automatic evaluators can have biases and may not cover all potential risks.

LLM evaluation model development chatbot performance AI research natural language processing

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 23 / 25

How are scores calculated?

Stars

1,957

Forks

305

Language

Jupyter Notebook

License

Apache-2.0

Featured in

You're Shipping AI You Can't Measure

Related models

eth-sri/matharena

Evaluation of LLMs on latest math competitions

HPAI-BSC/TuRTLe

TuRTLe: A Unified Evaluation of LLMs for RTL Generation 🐢 (MLCAD 2025)

nlp-uoregon/mlmm-evaluation

Multilingual Large Language Models Evaluation Benchmark

haesleinhuepf/human-eval-bia

Benchmarking Large Language Models for Bio-Image Analysis Code Generation

ShuntaroOkuma/adapt-gauge-core

Measure LLM adaptation efficiency — how fast models learn from few examples

Explore Transformer Models

All categories Trending Transformer directory Insights