rungalileo/agent-leaderboard

Ranking LLMs on agentic tasks

47
/ 100
Emerging

This project helps businesses rigorously evaluate AI customer service agents for specific industry needs before deployment. It takes detailed descriptions of business scenarios, user goals, and domain-specific tools, then simulates multi-turn conversations to assess how well an AI agent completes tasks and selects the right tools. Operations managers, product teams, and enterprise architects would use this to validate AI agent performance.

217 stars.

Use this if you need to determine if an AI agent can reliably handle complex, real-world customer interactions and multi-step tasks within your specific business domain like banking, healthcare, or insurance.

Not ideal if you are looking for a simple benchmark of an LLM's general tool-calling ability rather than its performance in specific enterprise scenarios.

AI agent evaluation customer service automation enterprise AI deployment business process simulation AI performance validation
No Package No Dependents
Maintenance 6 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 15 / 25

How are scores calculated?

Stars

217

Forks

23

Language

Jupyter Notebook

License

MIT

Last pushed

Nov 18, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/agents/rungalileo/agent-leaderboard"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.