GoodAI/goodai-ltm-benchmark

A library for benchmarking the Long Term Memory and Continual learning capabilities of LLM based agents. With all the tests and code you need to evaluate your own agents. See more in the blogpost:

/ 100

Emerging

This tool helps AI researchers and developers evaluate how well their conversational AI agents remember information over very long interactions. You feed it your LLM-based agent, choose a benchmark configuration, and it produces a detailed HTML report showing how accurately your agent retrieves and integrates information across extended conversations and various memory tasks. This is for AI developers, researchers, or anyone building advanced LLM agents.

No commits in the last 6 months.

Use this if you are building conversational AI agents and need to systematically test and compare their long-term memory capabilities over extensive dialogues and complex information recall scenarios.

Not ideal if you are looking for a simple API to integrate an LLM into an application or if your primary focus is on short-term conversational quality rather than advanced long-term memory.

conversational-ai llm-evaluation agent-benchmarking continual-learning memory-testing

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 9 / 25

Maturity 16 / 25

Community 15 / 25

How are scores calculated?

Stars

Forks

Language

HTML

License

—

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

sierra-research/tau2-bench

τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment

xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

bigcode-project/bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

scicode-bench/SciCode

A benchmark that challenges language models to code solutions for scientific problems

Explore LLM Tools

All categories Trending LLM Tool directory Insights