rungalileo/agent-leaderboard
Ranking LLMs on agentic tasks
This project helps businesses rigorously evaluate AI customer service agents for specific industry needs before deployment. It takes detailed descriptions of business scenarios, user goals, and domain-specific tools, then simulates multi-turn conversations to assess how well an AI agent completes tasks and selects the right tools. Operations managers, product teams, and enterprise architects would use this to validate AI agent performance.
217 stars.
Use this if you need to determine if an AI agent can reliably handle complex, real-world customer interactions and multi-step tasks within your specific business domain like banking, healthcare, or insurance.
Not ideal if you are looking for a simple benchmark of an LLM's general tool-calling ability rather than its performance in specific enterprise scenarios.
Stars
217
Forks
23
Language
Jupyter Notebook
License
MIT
Category
Last pushed
Nov 18, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/agents/rungalileo/agent-leaderboard"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
StonyBrookNLP/appworld
🌍 AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and...
qualifire-dev/rogue
AI Agent Evaluator & Red Team Platform
microsoft/WindowsAgentArena
Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of...
future-agi/ai-evaluation
Evaluation Framework for all your AI related Workflows
RouteWorks/RouterArena
RouterArena: An open framework for evaluating LLM routers with standardized datasets, metrics,...