modelscope/evalscope

A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.

/ 100

Verified

This tool helps AI model developers and researchers objectively assess how well large language models (LLMs), vision-language models (VLMs), and other generative AI models perform. You provide various models and datasets, and it generates detailed comparison reports and performance metrics, including stress test results and interactive visualizations. It helps you understand a model's strengths and weaknesses across different tasks and benchmarks.

2,501 stars. Used by 1 other package. Actively maintained with 34 commits in the last 30 days. Available on PyPI.

Use this if you need to thoroughly benchmark and compare multiple large AI models (LLMs, VLMs, AIGC) against standard datasets or custom criteria to determine their effectiveness for specific applications.

Not ideal if you are a casual user looking for a simple API to integrate a pre-trained model without needing deep performance analysis or custom evaluation.

AI model benchmarking Generative AI evaluation Large model comparison AI performance testing Model quality assurance

Maintenance 20 / 25

Adoption 11 / 25

Maturity 25 / 25

Community 21 / 25

How are scores calculated?

Stars

2,501

Forks

285

Language

Python

License

Apache-2.0

Compare

evalscope and ragrank evalscope and llm-eval evalscope and continuous-eval evalscope and llm-eval-bench

Related tools

izam-mohammed/ragrank

🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it...

Kareem-Rashed/rubric-eval

Independent framework to test, benchmark, and evaluate LLMs & AI agents locally.

justplus/llm-eval

大语言模型评估平台，支持多种评估基准、自定义数据集和性能测试。支持基于自定义数据集的RAG评估。

relari-ai/continuous-eval

Data-Driven Evaluation for LLM-Powered Applications

cleanlab/tlm

Score the trustworthiness of outputs from any LLM in real-time

Explore RAG Tools

All categories Trending RAG directory Insights