AgentBench and LawBench

These two tools are complements within the domain-specific benchmarks category; LawBench specifically evaluates LLMs on their legal knowledge, while AgentBench provides a broader framework for evaluating LLMs as agents across various domains, meaning LawBench could be utilized as a specific test within the AgentBench framework for the legal domain.

AgentBench
55
Established
LawBench
48
Emerging
Maintenance 10/25
Adoption 10/25
Maturity 16/25
Community 19/25
Maintenance 0/25
Adoption 10/25
Maturity 16/25
Community 22/25
Stars: 3,234
Forks: 241
Downloads:
Commits (30d): 0
Language: Python
License: Apache-2.0
Stars: 406
Forks: 70
Downloads:
Commits (30d): 0
Language: Python
License: Apache-2.0
No Package No Dependents
Stale 6m No Package No Dependents

About AgentBench

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

This project helps developers and researchers evaluate how well large language models (LLMs) can act as autonomous 'agents' in various real-world scenarios. It takes an LLM as input and runs it through a standardized set of tasks, like interacting with an operating system, using a database, or shopping online. The output is a performance score, showing how effectively the LLM completes these multi-step, interactive tasks.

LLM evaluation agentic AI AI research model benchmarking autonomous systems

About LawBench

open-compass/LawBench

Benchmarking Legal Knowledge of Large Language Models

This tool helps legal professionals, researchers, or anyone evaluating AI, understand how well large language models (LLMs) perform on various legal tasks specific to the Chinese legal system. It takes an LLM's responses to legal queries and scenarios, then outputs a detailed assessment of its legal knowledge, comprehension, and application abilities. Legal domain experts can use this to gauge an AI's readiness for real-world legal applications.

legal-tech AI-evaluation Chinese-law legal-research compliance-automation

Scores updated daily from GitHub, PyPI, and npm data. How scores work