EternityYW/TRAM-Benchmark
TRAM: Benchmarking Temporal Reasoning for Large Language Models (Findings of ACL 2024)
This project provides a comprehensive benchmark for evaluating how well large language models (LLMs) understand and reason about time in natural language. It offers a collection of over half a million multiple-choice questions across ten diverse temporal tasks. Researchers and developers working on LLMs can use this to assess and compare the temporal reasoning capabilities of different models.
No commits in the last 6 months.
Use this if you are developing or fine-tuning large language models and need to rigorously test their ability to handle time-related information and questions.
Not ideal if you are looking for a general-purpose natural language processing tool for non-temporal tasks or for direct integration into an application.
Stars
26
Forks
2
Language
Jupyter Notebook
License
MIT
Category
Last pushed
Jun 21, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/EternityYW/TRAM-Benchmark"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
stanfordnlp/axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
aidatatools/ollama-benchmark
LLM Benchmark for Throughput via Ollama (Local LLMs)
LarHope/ollama-benchmark
Ollama based Benchmark with detail I/O token per second. Python with Deepseek R1 example.
qcri/LLMeBench
Benchmarking Large Language Models
THUDM/LongBench
LongBench v2 and LongBench (ACL 25'&24')