evalscope and llm-eval-bench

These are competitors—both provide evaluation frameworks for LLMs and RAG systems, but evalscope offers broader coverage (LLM, VLM, AIGC) while llm-eval-bench focuses specifically on prompts and structured outputs, making them alternative choices rather than tools designed to work together.

evalscope

Verified

llm-eval-bench

Experimental

Maintenance 20/25

Adoption 11/25

Maturity 25/25

Community 21/25

Maintenance 13/25

Adoption 0/25

Maturity 9/25

Community 0/25

Stars: 2,501

Forks: 285

Downloads: —

Commits (30d): 34

Language: Python

License: Apache-2.0

Stars: —

Forks: —

Downloads: —

Commits (30d): 0

Language: Python

License: MIT

No risk flags

No Package No Dependents

About evalscope

modelscope/evalscope

A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.

This tool helps AI model developers and researchers objectively assess how well large language models (LLMs), vision-language models (VLMs), and other generative AI models perform. You provide various models and datasets, and it generates detailed comparison reports and performance metrics, including stress test results and interactive visualizations. It helps you understand a model's strengths and weaknesses across different tasks and benchmarks.

AI model benchmarking Generative AI evaluation Large model comparison AI performance testing Model quality assurance

About llm-eval-bench

piog/llm-eval-bench

Evaluation harness for prompts, structured outputs, and RAG workflows

Related comparisons

evalscope and ragrank evalscope and llm-eval evalscope and continuous-eval evalscope and ragrank

Scores updated daily from GitHub, PyPI, and npm data. How scores work