sylvain-wei/TIME
[NeurIPS 2025 D&B (Spotlight🌟)] TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenario
This project offers a specialized benchmark dataset and evaluation tools to assess how well large language models (LLMs) understand and reason about time in real-world situations. It takes text data from Wikipedia, news articles, and dialogues as input and provides detailed scores on an LLM's ability to handle intensive temporal information, fast-changing events, and complex social interactions. Researchers and developers working on improving LLM capabilities will find this useful.
No commits in the last 6 months.
Use this if you are developing or evaluating large language models and need a comprehensive way to test their temporal reasoning skills across various real-world data types and specific temporal tasks.
Not ideal if you are an end-user looking to apply an LLM to a specific business problem, rather than developing or benchmarking the LLM itself.
Stars
30
Forks
—
Language
Python
License
—
Category
Last pushed
Oct 05, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/rag/sylvain-wei/TIME"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
MemoriLabs/Memori
SQL Native Memory Layer for LLMs, AI Agents & Multi-Agent Systems
volcengine/OpenViking
OpenViking is an open-source context database designed specifically for AI Agents(such as...
mem0ai/mem0
Universal memory layer for AI Agents
zjunlp/LightMem
[ICLR 2026] LightMem: Lightweight and Efficient Memory-Augmented Generation
MemTensor/MemOS
AI memory OS for LLM and Agent systems(moltbot,clawdbot,openclaw), enabling persistent Skill...