evalscope and llm-eval-bench

These are competitors—both provide evaluation frameworks for LLMs and RAG systems, but evalscope offers broader coverage (LLM, VLM, AIGC) while llm-eval-bench focuses specifically on prompts and structured outputs, making them alternative choices rather than tools designed to work together.

evalscope
77
Verified
llm-eval-bench
22
Experimental
Maintenance 20/25
Adoption 11/25
Maturity 25/25
Community 21/25
Maintenance 13/25
Adoption 0/25
Maturity 9/25
Community 0/25
Stars: 2,501
Forks: 285
Downloads:
Commits (30d): 34
Language: Python
License: Apache-2.0
Stars:
Forks:
Downloads:
Commits (30d): 0
Language: Python
License: MIT
No risk flags
No Package No Dependents

About evalscope

modelscope/evalscope

A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.

This tool helps AI model developers and researchers objectively assess how well large language models (LLMs), vision-language models (VLMs), and other generative AI models perform. You provide various models and datasets, and it generates detailed comparison reports and performance metrics, including stress test results and interactive visualizations. It helps you understand a model's strengths and weaknesses across different tasks and benchmarks.

AI model benchmarking Generative AI evaluation Large model comparison AI performance testing Model quality assurance

About llm-eval-bench

piog/llm-eval-bench

Evaluation harness for prompts, structured outputs, and RAG workflows

Scores updated daily from GitHub, PyPI, and npm data. How scores work