strands-agents/evals

A comprehensive evaluation framework for AI agents and LLM applications.

/ 100

Established

This framework helps AI developers and ML engineers assess the performance of their AI agents and large language model (LLM) applications. It takes in test cases (inputs and expected outcomes) and the agent's responses, then provides detailed reports on how well the AI performs against predefined criteria and simulated real-world scenarios. It's for anyone building or improving AI systems who needs to ensure they are accurate, reliable, and helpful.

Use this if you are developing AI agents or LLM applications and need a systematic way to measure their output quality, analyze their decision-making process, or simulate user interactions to identify areas for improvement.

Not ideal if you are a business user looking for a no-code solution to evaluate existing off-the-shelf AI products, as this framework requires programming knowledge to set up and integrate.

AI-development LLM-evaluation AI-testing agent-performance ML-operations

No Package No Dependents

Maintenance 10 / 25

Adoption 9 / 25

Maturity 15 / 25

Community 19 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

Apache-2.0

Related agents

eve-mas/eve-parity

Equilibrium Verification Environment (EVE) is a formal verification tool for the automated...

usestrix/benchmarks

Evaluation harness for Strix agent

KazKozDev/murmur

A Mix of Agents Orchestration System for Distributed LLM Processing

tanvirbhachu/ai-bench

A CLI benchmark runner for testing AI Models quickly.

davidset13/intelligence_eval

This will allow any agent to use LLM evaluation benchmarks. Currently, this only supports the...

Explore AI Agents

All categories Trending AI Agent directory Insights