Llm Evaluation Benchmarking AI Agents

There are 9 llm evaluation benchmarking agents tracked. 1 score above 50 (established tier). The highest-rated is strands-agents/evals at 53/100 with 82 stars.

Get all 9 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=agents&subcategory=llm-evaluation-benchmarking&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Agent	Score	Tier	Stars	Language
1	strands-agents/evals A comprehensive evaluation framework for AI agents and LLM applications.	53	Established	82	Python
2	eve-mas/eve-parity Equilibrium Verification Environment (EVE) is a formal verification tool for...	37	Emerging	21	Python
3	usestrix/benchmarks Evaluation harness for Strix agent	34	Emerging	9	Python
4	KazKozDev/murmur A Mix of Agents Orchestration System for Distributed LLM Processing	21	Experimental	4	Python
5	tanvirbhachu/ai-bench A CLI benchmark runner for testing AI Models quickly.	20	Experimental	1	TypeScript
6	davidset13/intelligence_eval This will allow any agent to use LLM evaluation benchmarks. Currently, this...	19	Experimental	2	Python
7	VarshVishwakarma/stackbench STACKBENCH is a multi-agent AI research copilot that evaluates developer...	17	Experimental	—	Python
8	Software-Engineering-Arena/SWE-Model-Arena Compare tool-calling models pairwise via multi‑round evaluations for SE tasks.	14	Experimental	1	Python
9	42olver/ai-agent-benchmark-compendium 🛠️ Discover and explore over 50 benchmarks for AI agents across key...	14	Experimental	—	—