GAIR-NLP/scaleeval
Scalable Meta-Evaluation of LLMs as Evaluators
This tool helps AI product managers and researchers determine which large language models (LLMs) are best at evaluating the quality of other LLMs. You input various LLM responses, specific evaluation criteria (like creativity or helpfulness), and different scenarios. The tool then outputs a ranked assessment of how well different LLMs perform as evaluators, based on a debate simulation.
No commits in the last 6 months.
Use this if you need to reliably select the best LLM to serve as an automated judge for the outputs of other LLMs.
Not ideal if you're looking for an LLM to generate content or perform direct tasks rather than evaluate other models.
Stars
43
Forks
3
Language
Python
License
—
Category
Last pushed
Feb 15, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/GAIR-NLP/scaleeval"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
eth-sri/matharena
Evaluation of LLMs on latest math competitions
tatsu-lab/alpaca_eval
An automatic evaluator for instruction-following language models. Human-validated, high-quality,...
HPAI-BSC/TuRTLe
TuRTLe: A Unified Evaluation of LLMs for RTL Generation 🐢 (MLCAD 2025)
nlp-uoregon/mlmm-evaluation
Multilingual Large Language Models Evaluation Benchmark
haesleinhuepf/human-eval-bia
Benchmarking Large Language Models for Bio-Image Analysis Code Generation