evalscope and continuous-eval

These are complementary tools—evalscope provides a broad evaluation framework across multiple model types (LLMs, VLMs, AIGC), while continuous-eval specializes in production-focused, data-driven evaluation metrics specifically optimized for LLM-powered applications, allowing teams to use both for different evaluation stages and purposes.

evalscope
77
Verified
continuous-eval
41
Emerging
Maintenance 20/25
Adoption 11/25
Maturity 25/25
Community 21/25
Maintenance 0/25
Adoption 10/25
Maturity 16/25
Community 15/25
Stars: 2,501
Forks: 285
Downloads:
Commits (30d): 34
Language: Python
License: Apache-2.0
Stars: 516
Forks: 37
Downloads:
Commits (30d): 0
Language: Python
License: Apache-2.0
No risk flags
Stale 6m No Package No Dependents

About evalscope

modelscope/evalscope

A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.

This tool helps AI model developers and researchers objectively assess how well large language models (LLMs), vision-language models (VLMs), and other generative AI models perform. You provide various models and datasets, and it generates detailed comparison reports and performance metrics, including stress test results and interactive visualizations. It helps you understand a model's strengths and weaknesses across different tasks and benchmarks.

AI model benchmarking Generative AI evaluation Large model comparison AI performance testing Model quality assurance

About continuous-eval

relari-ai/continuous-eval

Data-Driven Evaluation for LLM-Powered Applications

This tool helps AI engineers and MLOps professionals rigorously test and refine their Large Language Model (LLM) applications. It takes in datasets of questions, retrieved contexts, and generated answers, then outputs comprehensive performance metrics. You'd use this to understand how well your LLM application is performing across different stages, like retrieval or generation, and identify areas for improvement.

LLM-development AI-evaluation MLOps RAG-systems AI-testing

Scores updated daily from GitHub, PyPI, and npm data. How scores work