HUST-AI-HYZ/MemoryAgentBench

Open source code for ICLR 2026 Paper: Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

/ 100

Emerging

This project helps AI developers and researchers evaluate how well their large language model (LLM) agents remember information over extended, multi-turn conversations. It takes an LLM agent and a dataset of questions and scenarios as input, then outputs performance metrics across key memory competencies like accurate retrieval and conflict resolution. This is for anyone building or researching AI assistants that need to maintain context and learn across many interactions.

253 stars.

Use this if you are developing or studying LLM agents and need a standardized way to measure their long-term memory capabilities through realistic, incremental conversations.

Not ideal if you are an end-user looking for an AI agent to solve a specific problem, rather than evaluating the memory performance of such agents.

LLM-development AI-agent-evaluation conversational-AI natural-language-processing machine-learning-research

No License No Package No Dependents

Maintenance 10 / 25

Adoption 10 / 25

Maturity 7 / 25

Community 19 / 25

How are scores calculated?

Stars

253

Forks

Language

Python

License

—

Featured in

You're Shipping AI You Can't Measure

Compare

MemoryAgentBench and AgentBench

Higher-rated alternatives

sierra-research/tau2-bench

τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment

xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

bigcode-project/bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

scicode-bench/SciCode

A benchmark that challenges language models to code solutions for scientific problems

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

Explore LLM Tools

All categories Trending LLM Tool directory Insights