lemon07r/SanityBoard

Home of the SanityHarness Leaderboard website.

/ 100

Experimental

This tool provides a centralized hub to track and compare the performance of AI coding agents. It takes in structured evaluation data from agent runs, like scores and pass rates for different coding tasks, and displays them on a browsable leaderboard. This is ideal for researchers, developers, or evaluators who need to assess and benchmark AI agents.

Use this if you need a clear, interactive way to visualize and compare the evaluation results of various AI coding agents.

Not ideal if you're looking for a tool to run the AI agent evaluations themselves, as this focuses solely on displaying pre-existing results.

AI-agent-evaluation coding-agent-benchmarking AI-performance-tracking developer-tools research-analytics

No License No Package No Dependents

Maintenance 10 / 25

Adoption 5 / 25

Maturity 3 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

HTML

License

—

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

sierra-research/tau2-bench

τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment

xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

bigcode-project/bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

scicode-bench/SciCode

A benchmark that challenges language models to code solutions for scientific problems

Explore LLM Tools

All categories Trending LLM Tool directory Insights