TheDuckAI/arb
Advanced Reasoning Benchmark Dataset for LLMs
This project helps AI researchers and developers thoroughly test how well large language models (LLMs) understand complex text and apply expert knowledge. It takes a specific LLM and evaluates its ability to answer challenging questions across scientific and legal domains. The output is a performance score, showing how capable the LLM is at advanced reasoning tasks.
No commits in the last 6 months.
Use this if you are developing or evaluating large language models and need a robust way to benchmark their deep comprehension and expert reasoning capabilities across various challenging subjects.
Not ideal if you are looking for a tool to train an LLM or to apply an LLM for general knowledge retrieval, as its primary purpose is rigorous evaluation.
Stars
47
Forks
3
Language
TypeScript
License
MIT
Category
Last pushed
Nov 19, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/TheDuckAI/arb"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
sierra-research/tau2-bench
τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment
xlang-ai/OSWorld
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
bigcode-project/bigcodebench
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
THUDM/AgentBench
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
scicode-bench/SciCode
A benchmark that challenges language models to code solutions for scientific problems