AgentBench and MemoryAgentBench

These two tools are complements, with MemoryAgentBench specifically extending AgentBench by focusing on the specialized evaluation of memory capabilities in LLM agents through incremental multi-turn interactions.

AgentBench
55
Established
MemoryAgentBench
46
Emerging
Maintenance 10/25
Adoption 10/25
Maturity 16/25
Community 19/25
Maintenance 10/25
Adoption 10/25
Maturity 7/25
Community 19/25
Stars: 3,234
Forks: 241
Downloads:
Commits (30d): 0
Language: Python
License: Apache-2.0
Stars: 253
Forks: 41
Downloads:
Commits (30d): 0
Language: Python
License:
No Package No Dependents
No License No Package No Dependents

About AgentBench

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

This project helps developers and researchers evaluate how well large language models (LLMs) can act as autonomous 'agents' in various real-world scenarios. It takes an LLM as input and runs it through a standardized set of tasks, like interacting with an operating system, using a database, or shopping online. The output is a performance score, showing how effectively the LLM completes these multi-step, interactive tasks.

LLM evaluation agentic AI AI research model benchmarking autonomous systems

About MemoryAgentBench

HUST-AI-HYZ/MemoryAgentBench

Open source code for ICLR 2026 Paper: Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

This project helps AI developers and researchers evaluate how well their large language model (LLM) agents remember information over extended, multi-turn conversations. It takes an LLM agent and a dataset of questions and scenarios as input, then outputs performance metrics across key memory competencies like accurate retrieval and conflict resolution. This is for anyone building or researching AI assistants that need to maintain context and learn across many interactions.

LLM-development AI-agent-evaluation conversational-AI natural-language-processing machine-learning-research

Scores updated daily from GitHub, PyPI, and npm data. How scores work