THUDM/LongBench

LongBench v2 and LongBench (ACL 25'&24')

/ 100

Emerging

LongBench v2 helps AI researchers and developers evaluate how well their large language models (LLMs) can understand and reason with very long texts, from 8,000 to 2 million words. It takes your LLM and provides a standardized dataset of challenging multiple-choice questions across diverse tasks. The output is a performance score, showing how accurately your LLM handles complex, real-world long-context scenarios compared to human experts and other models.

1,113 stars. No commits in the last 6 months.

Use this if you are developing or fine-tuning large language models and need a robust, challenging benchmark to assess their deep understanding and reasoning abilities on extremely long, diverse contexts.

Not ideal if you are looking for a simple, quick evaluation for short-context LLM tasks or do not have the infrastructure to deploy and test large models with extensive context windows.

large-language-models LLM-evaluation natural-language-processing AI-research long-context-understanding

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 19 / 25

How are scores calculated?

Stars

1,113

Forks

120

Language

Python

License

MIT

Higher-rated alternatives

stanfordnlp/axbench

Stanford NLP Python library for benchmarking the utility of LLM interpretability methods

aidatatools/ollama-benchmark

LLM Benchmark for Throughput via Ollama (Local LLMs)

LarHope/ollama-benchmark

Ollama based Benchmark with detail I/O token per second. Python with Deepseek R1 example.

qcri/LLMeBench

Benchmarking Large Language Models

microsoft/LLF-Bench

A benchmark for evaluating learning agents based on just language feedback

Explore Transformer Models

All categories Trending Transformer directory Insights