om-ai-lab/open-agent-leaderboard

Reproducible Language Agent Research

/ 100

Experimental

This project helps AI researchers and developers compare the performance of different language agents across various benchmarks and large language models (LLMs). It takes specific agent algorithms (like Chain-of-Thought or ReAct) and LLM choices as inputs, then outputs a clear, fair performance score on datasets like GSM8K or MATH-500. This tool is for anyone developing, evaluating, or selecting advanced language agents for problem-solving tasks.

No commits in the last 6 months.

Use this if you need to rigorously compare how different language agent algorithms and LLMs perform on common reasoning and mathematical tasks.

Not ideal if you are a general user looking for a ready-to-use application rather than a tool for agent research and evaluation.

AI-research language-agent-evaluation LLM-benchmarking natural-language-processing algorithm-comparison

No License Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 7 / 25

Maturity 8 / 25

Community 6 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

mitdbg/palimpzest

A System for Optimized Semantic Computation

SamurAIGPT/GPT-Agent

🚀 Introducing 🐪 CAMEL: a game-changing role-playing approach for LLMs and auto-agents like...

bubbuild/republic

Build LLM workflows like normal Python while keeping a full audit trail by default.

lwcsrf/netflux

Minimalist framework for authoring custom agentic applications in python; emphasizes task...

dlMARiA/Syzygy-of-thoughts

Syzygy-of-thoughts

Explore LLM Tools

All categories Trending LLM Tool directory Insights