parameterlab/MASEval

Multi-Agent LLM Evaluation

/ 100

Established

This is for AI researchers and developers who need to compare how well different multi-agent LLM systems perform. It takes your existing agent implementations (from frameworks like AutoGen or LangChain) and runs them through standard benchmarks or your own custom evaluation tasks. The output helps you understand which agent architectures and configurations are most effective for specific challenges.

Used by 1 other package. Available on PyPI.

Use this if you need to objectively evaluate and compare the performance of various multi-agent LLM systems or individual agents using standardized benchmarks.

Not ideal if you're looking for a tool that helps you build or design multi-agent systems, define communication protocols, or turn LLMs into agents.

AI-research LLM-benchmarking agent-system-evaluation multi-agent-development AI-performance-testing

Maintenance 10 / 25

Adoption 7 / 25

Maturity 22 / 25

Community 16 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Featured in

You're Shipping AI You Can't Measure

Compare

MASEval and lmms-eval

Related tools

EvolvingLMMs-Lab/lmms-eval

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

vibrantlabsai/ragas

Supercharge Your LLM Application Evaluations 🚀

open-compass/VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

EuroEval/EuroEval

The robust European language model benchmark.

Giskard-AI/giskard-oss

🐢 Open-Source Evaluation & Testing library for LLM Agents

Explore LLM Tools

All categories Trending LLM Tool directory Insights