OpenBMB/InfiniteBench
Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718
InfiniteBench provides a specialized dataset and framework to test how well large language models can handle extremely long texts, over 100,000 tokens. It takes various forms of lengthy documents like books, code, or dialogues as input and evaluates the model's ability to summarize, answer questions, debug code, or perform calculations on them. This is primarily for AI researchers and developers working on advanced language models to understand their limitations with extended context.
378 stars. No commits in the last 6 months.
Use this if you are developing or evaluating a large language model and need to thoroughly test its ability to process and reason over very long documents, beyond what traditional benchmarks offer.
Not ideal if you are looking for a benchmark to evaluate standard language model tasks with typical context lengths, or if your primary interest is in fine-tuning existing models for shorter-context applications.
Stars
378
Forks
32
Language
Python
License
MIT
Category
Last pushed
Sep 25, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/OpenBMB/InfiniteBench"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
stanfordnlp/axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
aidatatools/ollama-benchmark
LLM Benchmark for Throughput via Ollama (Local LLMs)
LarHope/ollama-benchmark
Ollama based Benchmark with detail I/O token per second. Python with Deepseek R1 example.
qcri/LLMeBench
Benchmarking Large Language Models
THUDM/LongBench
LongBench v2 and LongBench (ACL 25'&24')