OpenBMB/InfiniteBench

Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718

/ 100

Emerging

InfiniteBench provides a specialized dataset and framework to test how well large language models can handle extremely long texts, over 100,000 tokens. It takes various forms of lengthy documents like books, code, or dialogues as input and evaluates the model's ability to summarize, answer questions, debug code, or perform calculations on them. This is primarily for AI researchers and developers working on advanced language models to understand their limitations with extended context.

378 stars. No commits in the last 6 months.

Use this if you are developing or evaluating a large language model and need to thoroughly test its ability to process and reason over very long documents, beyond what traditional benchmarks offer.

Not ideal if you are looking for a benchmark to evaluate standard language model tasks with typical context lengths, or if your primary interest is in fine-tuning existing models for shorter-context applications.

large-language-models natural-language-processing model-evaluation long-context-understanding AI-research

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 14 / 25

How are scores calculated?

Stars

378

Forks

Language

Python

License

MIT

Higher-rated alternatives

stanfordnlp/axbench

Stanford NLP Python library for benchmarking the utility of LLM interpretability methods

aidatatools/ollama-benchmark

LLM Benchmark for Throughput via Ollama (Local LLMs)

LarHope/ollama-benchmark

Ollama based Benchmark with detail I/O token per second. Python with Deepseek R1 example.

qcri/LLMeBench

Benchmarking Large Language Models

THUDM/LongBench

LongBench v2 and LongBench (ACL 25'&24')

Explore Transformer Models

All categories Trending Transformer directory Insights