AgentBench and LawBench
These two tools are complements within the domain-specific benchmarks category; LawBench specifically evaluates LLMs on their legal knowledge, while AgentBench provides a broader framework for evaluating LLMs as agents across various domains, meaning LawBench could be utilized as a specific test within the AgentBench framework for the legal domain.
About AgentBench
THUDM/AgentBench
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
This project helps developers and researchers evaluate how well large language models (LLMs) can act as autonomous 'agents' in various real-world scenarios. It takes an LLM as input and runs it through a standardized set of tasks, like interacting with an operating system, using a database, or shopping online. The output is a performance score, showing how effectively the LLM completes these multi-step, interactive tasks.
About LawBench
open-compass/LawBench
Benchmarking Legal Knowledge of Large Language Models
This tool helps legal professionals, researchers, or anyone evaluating AI, understand how well large language models (LLMs) perform on various legal tasks specific to the Chinese legal system. It takes an LLM's responses to legal queries and scenarios, then outputs a detailed assessment of its legal knowledge, comprehension, and application abilities. Legal domain experts can use this to gauge an AI's readiness for real-world legal applications.
Related comparisons
Scores updated daily from GitHub, PyPI, and npm data. How scores work