THUDM/LongBench

LongBench v2 and LongBench (ACL 25'&24')

45
/ 100
Emerging

LongBench v2 helps AI researchers and developers evaluate how well their large language models (LLMs) can understand and reason with very long texts, from 8,000 to 2 million words. It takes your LLM and provides a standardized dataset of challenging multiple-choice questions across diverse tasks. The output is a performance score, showing how accurately your LLM handles complex, real-world long-context scenarios compared to human experts and other models.

1,113 stars. No commits in the last 6 months.

Use this if you are developing or fine-tuning large language models and need a robust, challenging benchmark to assess their deep understanding and reasoning abilities on extremely long, diverse contexts.

Not ideal if you are looking for a simple, quick evaluation for short-context LLM tasks or do not have the infrastructure to deploy and test large models with extensive context windows.

large-language-models LLM-evaluation natural-language-processing AI-research long-context-understanding
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 19 / 25

How are scores calculated?

Stars

1,113

Forks

120

Language

Python

License

MIT

Last pushed

Jan 15, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/THUDM/LongBench"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.