zchuz/TimeBench

The repository for ACL 2024 paper "TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models"

/ 100

Experimental

TimeBench helps researchers and practitioners evaluate how well large language models (LLMs) understand and reason about time. You provide a set of LLMs and get back a detailed performance analysis across various temporal reasoning tasks, revealing their strengths and weaknesses in handling dates, sequences, and event durations. This is for anyone researching, developing, or deploying LLMs who needs to understand their temporal intelligence.

No commits in the last 6 months.

Use this if you need to rigorously test and compare different large language models' abilities to process and understand temporal information, from simple date arithmetic to complex event sequencing.

Not ideal if you are looking for a tool to train LLMs or apply them directly to a specific business problem, rather than evaluate their fundamental temporal reasoning capabilities.

AI-research LLM-evaluation natural-language-understanding model-benchmarking temporal-reasoning

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 7 / 25

Maturity 16 / 25

Community 6 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Higher-rated alternatives

stanfordnlp/axbench

Stanford NLP Python library for benchmarking the utility of LLM interpretability methods

aidatatools/ollama-benchmark

LLM Benchmark for Throughput via Ollama (Local LLMs)

LarHope/ollama-benchmark

Ollama based Benchmark with detail I/O token per second. Python with Deepseek R1 example.

qcri/LLMeBench

Benchmarking Large Language Models

THUDM/LongBench

LongBench v2 and LongBench (ACL 25'&24')

Explore Transformer Models

All categories Trending Transformer directory Insights