LLM-Evaluation-s-Always-Fatiguing/leaf-playground

A framework to build scenario simulation projects where human and LLM based agents can participant in, with a user-friendly web UI to visualize simulation, support automatically evaluation on agent action level.

/ 100

Experimental

Evaluating how well large language models (LLMs) perform in specific scenarios or tasks can be time-consuming and complex. This framework helps you define complex scenarios where human and LLM agents interact, then automatically evaluates the LLMs' actions and visualizes the results. This is ideal for AI researchers, product managers, or developers who need to rigorously test and compare different LLM agent behaviors.

No commits in the last 6 months.

Use this if you need to set up realistic, interactive simulations to benchmark and understand the performance of LLM-based agents, with built-in visualization and automated evaluation.

Not ideal if you are looking for a simple, single-metric evaluation tool for basic LLM prompts, rather than agent behavior in complex, multi-turn scenarios.

LLM evaluation AI agent simulation conversational AI testing LLM application development human-AI interaction

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 7 / 25

Maturity 16 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

Python

License

MIT

Higher-rated alternatives

mitdbg/palimpzest

A System for Optimized Semantic Computation

SamurAIGPT/GPT-Agent

🚀 Introducing 🐪 CAMEL: a game-changing role-playing approach for LLMs and auto-agents like...

bubbuild/republic

Build LLM workflows like normal Python while keeping a full audit trail by default.

lwcsrf/netflux

Minimalist framework for authoring custom agentic applications in python; emphasizes task...

dlMARiA/Syzygy-of-thoughts

Syzygy-of-thoughts

Explore LLM Tools

All categories Trending LLM Tool directory Insights