strands-agents/evals
A comprehensive evaluation framework for AI agents and LLM applications.
This framework helps AI developers and ML engineers assess the performance of their AI agents and large language model (LLM) applications. It takes in test cases (inputs and expected outcomes) and the agent's responses, then provides detailed reports on how well the AI performs against predefined criteria and simulated real-world scenarios. It's for anyone building or improving AI systems who needs to ensure they are accurate, reliable, and helpful.
Use this if you are developing AI agents or LLM applications and need a systematic way to measure their output quality, analyze their decision-making process, or simulate user interactions to identify areas for improvement.
Not ideal if you are a business user looking for a no-code solution to evaluate existing off-the-shelf AI products, as this framework requires programming knowledge to set up and integrate.
Stars
82
Forks
21
Language
Python
License
Apache-2.0
Category
Last pushed
Mar 13, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/agents/strands-agents/evals"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related agents
eve-mas/eve-parity
Equilibrium Verification Environment (EVE) is a formal verification tool for the automated...
usestrix/benchmarks
Evaluation harness for Strix agent
KazKozDev/murmur
A Mix of Agents Orchestration System for Distributed LLM Processing
tanvirbhachu/ai-bench
A CLI benchmark runner for testing AI Models quickly.
davidset13/intelligence_eval
This will allow any agent to use LLM evaluation benchmarks. Currently, this only supports the...