sylvain-wei/TIME

[NeurIPS 2025 D&B (Spotlight🌟)] TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenario

16
/ 100
Experimental

This project offers a specialized benchmark dataset and evaluation tools to assess how well large language models (LLMs) understand and reason about time in real-world situations. It takes text data from Wikipedia, news articles, and dialogues as input and provides detailed scores on an LLM's ability to handle intensive temporal information, fast-changing events, and complex social interactions. Researchers and developers working on improving LLM capabilities will find this useful.

No commits in the last 6 months.

Use this if you are developing or evaluating large language models and need a comprehensive way to test their temporal reasoning skills across various real-world data types and specific temporal tasks.

Not ideal if you are an end-user looking to apply an LLM to a specific business problem, rather than developing or benchmarking the LLM itself.

LLM evaluation natural language understanding temporal reasoning AI research model benchmarking
No License Stale 6m No Package No Dependents
Maintenance 2 / 25
Adoption 7 / 25
Maturity 7 / 25
Community 0 / 25

How are scores calculated?

Stars

30

Forks

Language

Python

License

Last pushed

Oct 05, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/rag/sylvain-wei/TIME"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.