sylvain-wei/TIME

[NeurIPS 2025 D&B (Spotlight🌟)] TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenario

/ 100

Experimental

This project offers a specialized benchmark dataset and evaluation tools to assess how well large language models (LLMs) understand and reason about time in real-world situations. It takes text data from Wikipedia, news articles, and dialogues as input and provides detailed scores on an LLM's ability to handle intensive temporal information, fast-changing events, and complex social interactions. Researchers and developers working on improving LLM capabilities will find this useful.

No commits in the last 6 months.

Use this if you are developing or evaluating large language models and need a comprehensive way to test their temporal reasoning skills across various real-world data types and specific temporal tasks.

Not ideal if you are an end-user looking to apply an LLM to a specific business problem, rather than developing or benchmarking the LLM itself.

LLM evaluation natural language understanding temporal reasoning AI research model benchmarking

No License Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 7 / 25

Maturity 7 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

Python

License

—

Featured in

Agent Memory in 2026: What Actually Works for Persistent AI We Audited crewAI's AI Dependencies: Here's What the Data Says

Higher-rated alternatives

MemoriLabs/Memori

SQL Native Memory Layer for LLMs, AI Agents & Multi-Agent Systems

volcengine/OpenViking

OpenViking is an open-source context database designed specifically for AI Agents(such as...

mem0ai/mem0

Universal memory layer for AI Agents

zjunlp/LightMem

[ICLR 2026] LightMem: Lightweight and Efficient Memory-Augmented Generation

MemTensor/MemOS

AI memory OS for LLM and Agent systems(moltbot,clawdbot,openclaw), enabling persistent Skill...

Explore RAG Tools

All categories Trending RAG directory Insights