evalscope and ragrank

These are complements: evalscope provides a general-purpose LLM evaluation framework while ragrank specializes in RAG-specific metrics (factual accuracy, context understanding, tone), allowing them to be used together for comprehensive RAG system evaluation.

evalscope
77
Verified
ragrank
52
Established
Maintenance 20/25
Adoption 11/25
Maturity 25/25
Community 21/25
Maintenance 10/25
Adoption 8/25
Maturity 16/25
Community 18/25
Stars: 2,501
Forks: 285
Downloads:
Commits (30d): 34
Language: Python
License: Apache-2.0
Stars: 45
Forks: 14
Downloads:
Commits (30d): 0
Language: Python
License: Apache-2.0
No risk flags
No Package No Dependents

About evalscope

modelscope/evalscope

A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.

This tool helps AI model developers and researchers objectively assess how well large language models (LLMs), vision-language models (VLMs), and other generative AI models perform. You provide various models and datasets, and it generates detailed comparison reports and performance metrics, including stress test results and interactive visualizations. It helps you understand a model's strengths and weaknesses across different tasks and benchmarks.

AI model benchmarking Generative AI evaluation Large model comparison AI performance testing Model quality assurance

About ragrank

izam-mohammed/ragrank

🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it understands context, its tone, and more. This helps you see how good your LLM applications are.

This toolkit helps you assess the performance of your Retrieval-Augmented Generation (RAG) applications. You provide your RAG model's questions, the contexts it retrieves, and its generated responses, and it gives you metrics on factual accuracy, context understanding, and tone. This is for AI/ML engineers, data scientists, or product managers who build and deploy LLM applications and need to ensure their RAG systems are delivering high-quality, reliable outputs.

LLM application development RAG system evaluation AI model quality assurance Natural Language Processing Generative AI

Scores updated daily from GitHub, PyPI, and npm data. How scores work