EternityYW/TRAM-Benchmark

TRAM: Benchmarking Temporal Reasoning for Large Language Models (Findings of ACL 2024)

/ 100

Emerging

This project provides a comprehensive benchmark for evaluating how well large language models (LLMs) understand and reason about time in natural language. It offers a collection of over half a million multiple-choice questions across ten diverse temporal tasks. Researchers and developers working on LLMs can use this to assess and compare the temporal reasoning capabilities of different models.

No commits in the last 6 months.

Use this if you are developing or fine-tuning large language models and need to rigorously test their ability to handle time-related information and questions.

Not ideal if you are looking for a general-purpose natural language processing tool for non-temporal tasks or for direct integration into an application.

large-language-models natural-language-understanding temporal-reasoning llm-evaluation nlp-benchmarking

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 7 / 25

Maturity 16 / 25

Community 7 / 25

How are scores calculated?

Stars

Forks

Language

Jupyter Notebook

License

MIT

Higher-rated alternatives

stanfordnlp/axbench

Stanford NLP Python library for benchmarking the utility of LLM interpretability methods

aidatatools/ollama-benchmark

LLM Benchmark for Throughput via Ollama (Local LLMs)

LarHope/ollama-benchmark

Ollama based Benchmark with detail I/O token per second. Python with Deepseek R1 example.

qcri/LLMeBench

Benchmarking Large Language Models

THUDM/LongBench

LongBench v2 and LongBench (ACL 25'&24')

Explore Transformer Models

All categories Trending Transformer directory Insights