caixd-220529/LifelongAgentBench

Code repo for "LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners"

/ 100

Experimental

This project helps AI researchers and developers evaluate how well large language model (LLM) agents learn continuously over time. It takes various LLM agents and scenarios (like database interactions or operating system tasks) as input, and outputs performance metrics that show if agents can retain knowledge and adapt to new information. Anyone working on building more robust and adaptable AI agents for real-world applications would use this.

No commits in the last 6 months.

Use this if you need to systematically test and benchmark the 'lifelong learning' capabilities of your LLM agents across diverse environments and tasks.

Not ideal if you are looking for a tool to build or deploy LLM agents for specific applications, rather than evaluate their learning capabilities.

AI research LLM evaluation agent learning AI benchmarking continual learning

No License Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 9 / 25

Maturity 7 / 25

Community 8 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

sierra-research/tau2-bench

τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment

xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

bigcode-project/bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

scicode-bench/SciCode

A benchmark that challenges language models to code solutions for scientific problems

Explore LLM Tools

All categories Trending LLM Tool directory Insights