Llm Evaluation Benchmarking AI Agents

There are 9 llm evaluation benchmarking agents tracked. 1 score above 50 (established tier). The highest-rated is strands-agents/evals at 53/100 with 82 stars.

Get all 9 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=agents&subcategory=llm-evaluation-benchmarking&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Agent Score Tier
1 strands-agents/evals

A comprehensive evaluation framework for AI agents and LLM applications.

53
Established
2 eve-mas/eve-parity

Equilibrium Verification Environment (EVE) is a formal verification tool for...

37
Emerging
3 usestrix/benchmarks

Evaluation harness for Strix agent

34
Emerging
4 KazKozDev/murmur

A Mix of Agents Orchestration System for Distributed LLM Processing

21
Experimental
5 tanvirbhachu/ai-bench

A CLI benchmark runner for testing AI Models quickly.

20
Experimental
6 davidset13/intelligence_eval

This will allow any agent to use LLM evaluation benchmarks. Currently, this...

19
Experimental
7 VarshVishwakarma/stackbench

STACKBENCH is a multi-agent AI research copilot that evaluates developer...

17
Experimental
8 Software-Engineering-Arena/SWE-Model-Arena

Compare tool-calling models pairwise via multi‑round evaluations for SE tasks.

14
Experimental
9 42olver/ai-agent-benchmark-compendium

🛠️ Discover and explore over 50 benchmarks for AI agents across key...

14
Experimental