GAIR-NLP/scaleeval

Scalable Meta-Evaluation of LLMs as Evaluators

/ 100

Experimental

This tool helps AI product managers and researchers determine which large language models (LLMs) are best at evaluating the quality of other LLMs. You input various LLM responses, specific evaluation criteria (like creativity or helpfulness), and different scenarios. The tool then outputs a ranked assessment of how well different LLMs perform as evaluators, based on a debate simulation.

No commits in the last 6 months.

Use this if you need to reliably select the best LLM to serve as an automated judge for the outputs of other LLMs.

Not ideal if you're looking for an LLM to generate content or perform direct tasks rather than evaluate other models.

LLM evaluation AI product management model selection AI research quality assessment

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 8 / 25

Maturity 8 / 25

Community 7 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

eth-sri/matharena

Evaluation of LLMs on latest math competitions

tatsu-lab/alpaca_eval

An automatic evaluator for instruction-following language models. Human-validated, high-quality,...

HPAI-BSC/TuRTLe

TuRTLe: A Unified Evaluation of LLMs for RTL Generation 🐢 (MLCAD 2025)

nlp-uoregon/mlmm-evaluation

Multilingual Large Language Models Evaluation Benchmark

haesleinhuepf/human-eval-bia

Benchmarking Large Language Models for Bio-Image Analysis Code Generation

Explore Transformer Models

All categories Trending Transformer directory Insights