tatsu-lab/alpaca_eval

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

51
/ 100
Established

This project helps evaluate how well your instruction-following language model (like a chatbot) performs compared to others. You input your model's responses to a set of instructions, and it provides a win-rate score against a reference model, indicating its quality. This tool is for AI researchers and developers who are building or fine-tuning large language models and need to quickly assess their performance.

1,957 stars. No commits in the last 6 months.

Use this if you are developing or fine-tuning instruction-following language models and need a fast, affordable, and highly correlated automatic evaluation method to guide your iteration cycles.

Not ideal if you need a definitive evaluation for high-stakes decisions like model release, as automatic evaluators can have biases and may not cover all potential risks.

LLM evaluation model development chatbot performance AI research natural language processing
Stale 6m No Package No Dependents
Maintenance 2 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 23 / 25

How are scores calculated?

Stars

1,957

Forks

305

Language

Jupyter Notebook

License

Apache-2.0

Last pushed

Aug 09, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/tatsu-lab/alpaca_eval"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.