THUDM/LongBench
LongBench v2 and LongBench (ACL 25'&24')
LongBench v2 helps AI researchers and developers evaluate how well their large language models (LLMs) can understand and reason with very long texts, from 8,000 to 2 million words. It takes your LLM and provides a standardized dataset of challenging multiple-choice questions across diverse tasks. The output is a performance score, showing how accurately your LLM handles complex, real-world long-context scenarios compared to human experts and other models.
1,113 stars. No commits in the last 6 months.
Use this if you are developing or fine-tuning large language models and need a robust, challenging benchmark to assess their deep understanding and reasoning abilities on extremely long, diverse contexts.
Not ideal if you are looking for a simple, quick evaluation for short-context LLM tasks or do not have the infrastructure to deploy and test large models with extensive context windows.
Stars
1,113
Forks
120
Language
Python
License
MIT
Category
Last pushed
Jan 15, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/THUDM/LongBench"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
stanfordnlp/axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
aidatatools/ollama-benchmark
LLM Benchmark for Throughput via Ollama (Local LLMs)
LarHope/ollama-benchmark
Ollama based Benchmark with detail I/O token per second. Python with Deepseek R1 example.
qcri/LLMeBench
Benchmarking Large Language Models
microsoft/LLF-Bench
A benchmark for evaluating learning agents based on just language feedback