om-ai-lab/open-agent-leaderboard

Reproducible Language Agent Research

23
/ 100
Experimental

This project helps AI researchers and developers compare the performance of different language agents across various benchmarks and large language models (LLMs). It takes specific agent algorithms (like Chain-of-Thought or ReAct) and LLM choices as inputs, then outputs a clear, fair performance score on datasets like GSM8K or MATH-500. This tool is for anyone developing, evaluating, or selecting advanced language agents for problem-solving tasks.

No commits in the last 6 months.

Use this if you need to rigorously compare how different language agent algorithms and LLMs perform on common reasoning and mathematical tasks.

Not ideal if you are a general user looking for a ready-to-use application rather than a tool for agent research and evaluation.

AI-research language-agent-evaluation LLM-benchmarking natural-language-processing algorithm-comparison
No License Stale 6m No Package No Dependents
Maintenance 2 / 25
Adoption 7 / 25
Maturity 8 / 25
Community 6 / 25

How are scores calculated?

Stars

34

Forks

2

Language

Python

License

Last pushed

Jun 25, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/om-ai-lab/open-agent-leaderboard"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.