open-compass/Ada-LEval

The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"

/ 100

Experimental

This tool helps AI researchers and developers systematically evaluate how well large language models (LLMs) can handle very long texts. You provide your custom LLM or select a known model, and the tool outputs detailed accuracy scores across various text lengths for tasks like ordering text segments or choosing the best answer from a long document. This is for professionals building or fine-tuning LLMs who need to understand their model's long-context comprehension capabilities.

No commits in the last 6 months.

Use this if you are developing or deploying large language models and need a rigorous, length-adaptable benchmark to measure their ability to process and understand extensive textual inputs.

Not ideal if you are looking for an LLM for general use or a benchmark for short-context tasks, as this focuses specifically on challenging long-context comprehension.

LLM evaluation natural language processing AI model testing long-context understanding text comprehension

No License Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 8 / 25

Maturity 8 / 25

Community 7 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

eth-sri/matharena

Evaluation of LLMs on latest math competitions

tatsu-lab/alpaca_eval

An automatic evaluator for instruction-following language models. Human-validated, high-quality,...

HPAI-BSC/TuRTLe

TuRTLe: A Unified Evaluation of LLMs for RTL Generation 🐢 (MLCAD 2025)

nlp-uoregon/mlmm-evaluation

Multilingual Large Language Models Evaluation Benchmark

haesleinhuepf/human-eval-bia

Benchmarking Large Language Models for Bio-Image Analysis Code Generation

Explore Transformer Models

All categories Trending Transformer directory Insights