HUST-AI-HYZ/MemoryAgentBench
Open source code for ICLR 2026 Paper: Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions
This project helps AI developers and researchers evaluate how well their large language model (LLM) agents remember information over extended, multi-turn conversations. It takes an LLM agent and a dataset of questions and scenarios as input, then outputs performance metrics across key memory competencies like accurate retrieval and conflict resolution. This is for anyone building or researching AI assistants that need to maintain context and learn across many interactions.
253 stars.
Use this if you are developing or studying LLM agents and need a standardized way to measure their long-term memory capabilities through realistic, incremental conversations.
Not ideal if you are an end-user looking for an AI agent to solve a specific problem, rather than evaluating the memory performance of such agents.
Stars
253
Forks
41
Language
Python
License
—
Category
Last pushed
Jan 27, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/HUST-AI-HYZ/MemoryAgentBench"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Compare
Higher-rated alternatives
sierra-research/tau2-bench
τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment
xlang-ai/OSWorld
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
bigcode-project/bigcodebench
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
scicode-bench/SciCode
A benchmark that challenges language models to code solutions for scientific problems
THUDM/AgentBench
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)