zchuz/TimeBench
The repository for ACL 2024 paper "TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models"
TimeBench helps researchers and practitioners evaluate how well large language models (LLMs) understand and reason about time. You provide a set of LLMs and get back a detailed performance analysis across various temporal reasoning tasks, revealing their strengths and weaknesses in handling dates, sequences, and event durations. This is for anyone researching, developing, or deploying LLMs who needs to understand their temporal intelligence.
No commits in the last 6 months.
Use this if you need to rigorously test and compare different large language models' abilities to process and understand temporal information, from simple date arithmetic to complex event sequencing.
Not ideal if you are looking for a tool to train LLMs or apply them directly to a specific business problem, rather than evaluate their fundamental temporal reasoning capabilities.
Stars
34
Forks
2
Language
Python
License
MIT
Category
Last pushed
Jun 29, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/zchuz/TimeBench"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
stanfordnlp/axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
aidatatools/ollama-benchmark
LLM Benchmark for Throughput via Ollama (Local LLMs)
LarHope/ollama-benchmark
Ollama based Benchmark with detail I/O token per second. Python with Deepseek R1 example.
qcri/LLMeBench
Benchmarking Large Language Models
THUDM/LongBench
LongBench v2 and LongBench (ACL 25'&24')