RAG Evaluation Frameworks RAG Tools

Tools and benchmarks for assessing RAG system performance across metrics like retrieval quality, generation accuracy, and end-to-end pipeline evaluation. Does NOT include RAG implementations themselves, embedding model comparisons, or domain-specific applications.

There are 86 rag evaluation frameworks tools tracked. 5 score above 50 (established tier). The highest-rated is vectara/open-rag-eval at 53/100 with 347 stars.

Get all 86 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=rag&subcategory=rag-evaluation-frameworks&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	vectara/open-rag-eval RAG evaluation without the need for "golden answers"	53	Established	347	Python
2	DocAILab/XRAG XRAG: eXamining the Core - Benchmarking Foundational Component Modules in...	53	Established	120	Python
3	HZYAI/RagScore ⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in...	53	Established	30	Python
4	AIAnytime/rag-evaluator A library for evaluating Retrieval-Augmented Generation (RAG) systems (The...	52	Established	42	Python
5	microsoft/benchmark-qed Automated benchmarking of Retrieval-Augmented Generation (RAG) systems	52	Established	78	Python
6	2501Pr0ject/RAGnarok-AI Local-first RAG evaluation framework for LLM applications. 100% local, no...	46	Emerging	13	Python
7	nuclia/nuclia-eval Library for evaluating RAG using Nuclia's models	43	Emerging	18	Python
8	TonicAI/tonic_validate Metrics to evaluate the quality of responses of your Retrieval Augmented...	43	Emerging	324	Python
9	syy12335/rag-eval-scaffold Lightweight, decoupled RAG evaluation scaffold (dataset → vector store → RAG...	40	Emerging	17	Python
10	AQ-MedAI/PRGB [AAAI 2026]RAG, Benchmark, robust RAG generation	39	Emerging	34	Python
11	SciPhi-AI/RAG-Performance Measuring RAG solutions throughput and latency	38	Emerging	19	Python
12	avnlp/rag-pipelines Advanced RAG Pipelines and Evaluation	38	Emerging	10	Python
13	vectara/mirage-bench Repository for Multililngual Generation, RAG evaluations, and surrogate...	37	Emerging	10	Python
14	gomate-community/rageval Evaluation tools for Retrieval-augmented Generation (RAG) methods.	36	Emerging	170	Python
15	AQ-MedAI/RagQALeaderboard RAG-QA Leaderboard	36	Emerging	25	Python
16	RulinShao/RAG-evaluation-harnesses An evaluation suite for Retrieval-Augmented Generation (RAG).	35	Emerging	23	Python
17	RUC-NLPIR/OmniEval Open source code of the paper: "OmniEval: An Omnidirectional and Automatic...	35	Emerging	82	Python
18	IAAR-Shanghai/CRUD_RAG CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented...	34	Emerging	362	Python
19	GURPREETKAURJETHRA/RAG-Evaluator A library for evaluating Retrieval-Augmented Generation (RAG) systems	33	Emerging	4	Python
20	TonicAI/tvallogging A tool for evaluating and tracking your RAG experiments. This repo contains...	33	Emerging	8	Python
21	RagView/RagView We believe that every SOTA result is only valid on its own dataset. RAGView...	32	Emerging	79	—
22	gomate-community/rag-bench RAG-Bench is to summarize all datasets used to evaluate RAG, from document...	30	Emerging	2	—
23	sitta07/RAGScope A lightweight observability tool for visualizing and comparing RAG retrieval...	29	Experimental	2	Python
24	amazon-science/MEMERAG MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval...	29	Experimental	4	Python
25	amazon-science/GaRAGe [ACL 2025] GaRAGe: A Benchmark with Grounding Annotations for RAG Evaluation.	28	Experimental	12	—
26	antgroup/ravig-bench Official implementation of "RAViG-Bench: A Benchmark for Retrieval-Augmented...	28	Experimental	10	Python
27	christopherkormpos/ragret Lightweight evaluation framework for Retrieval Augmented Generation systems,...	28	Experimental	3	Python
28	foreai-co/fore The fore client package	27	Experimental	13	Python
29	oztrkoguz/RAG-Framework-Evaluation This project aims to compare different Retrieval-Augmented Generation (RAG)...	27	Experimental	14	Python
30	ALucek/custom-rag-evals Applying domain specific evaluations to RAG chunking and embedding functions	26	Experimental	18	Jupyter Notebook
31	chu2bard/ragcraft End-to-end RAG pipeline with built-in evaluation metrics	26	Experimental	11	Python
32	Edouard-Legoupil/rag_extraction A tutorial on how to build Summary Brief from Evaluation Report - Offline+Open Source	25	Experimental	5	HTML
33	tarekmasryo/rag-qa-logs-corpus-data Synthetic multi-table RAG QA telemetry benchmark...	25	Experimental	2	Python
34	Abanoubr/rag-eval-toolkit Open-source Python toolkit for evaluating RAG pipelines. LLM-as-judge for...	25	Experimental	5	Python
35	rodolfboctor/rag-eval-toolkit Open-source Python toolkit for evaluating RAG pipelines. LLM-as-judge for...	25	Experimental	5	Python
36	utkuakbay/RAG_Benchmark Benchmark LLMs for your RAG system - Compare Gemini, GPT, Claude & local...	24	Experimental	4	Python
37	Ziqing110/rag-evidence-attack-lab Scientific QA robustness evaluation pipeline for evidence-missing RAG...	23	Experimental	1	Python
38	Sabyasachig/ragtrace DevTools for RAG pipelines	23	Experimental	1	Python
39	ssisOneTeam/Korean-Embedding-Model-Performance-Benchmark-for-Retriever Korean Sentence Embedding Model Performance Benchmark for RAG	23	Experimental	50	Jupyter Notebook
40	Monke1/ragcraft 📚 Build and evaluate RAG pipelines to ingest, embed, retrieve, and answer...	22	Experimental	—	Python
41	hari-sherith/bayesian-rag-uncertainty RAG system with Bayesian uncertainty quantification using Beta priors and...	22	Experimental	1	Jupyter Notebook
42	nblomerus/rag-bench RAG system for asking questions about AI/ML research papers	22	Experimental	—	Python
43	wigtn/wigtnOCR-v1 A research framework tA research framework to evaluate how document parsing...	22	Experimental	—	—
44	OpenSymbolicAI/benchmark-py-MultiHopRAG MultiHop-RAG Benchmark using GoalSeeking pattern from opensymbolicai-core	22	Experimental	—	Python
45	Miro96/nova-rag-benchmark Benchmark for Code RAG MCP Servers — measure how well RAG helps AI find the...	22	Experimental	—	Python
46	EmmanuelleB985/mmeval-vrag Evaluation Framework for Multimodal RAG Systems	22	Experimental	—	Python
47	clouatre-labs/rag-reranking-benchmarks Supplementary benchmarks for Making Legacy Knowledge Searchable with RAG	22	Experimental	1	Python
48	Aamirofficiall/rag-playbook Stop guessing which RAG pattern to use. Compare all 8 patterns with real...	22	Experimental	1	Python
49	Eustema-S-p-A/SCARF SCARF (System for Comprehensive Assessment of RAG Frameworks) is a modular...	22	Experimental	7	Python
50	fkapsahili/EntRAG EntRAG - Enterprise RAG Benchmark	22	Experimental	5	Python
51	srivsr/evalkit QA-grade RAG evaluation framework diagnosing retrieval, grounding,...	21	Experimental	—	Python
52	anita-builds/aurora-rag-evaluation Policy-grounded assistant notes: RAG and evaluation approach	21	Experimental	—	—
53	infrixo-systems/rag-evaluation-starter Minimal Python script to evaluate your RAG pipeline against a golden set. No...	21	Experimental	—	Python
54	itamaker/ragcheck Score retrieval runs with Precision@k, Recall@k, HitRate@k, and MRR@k.	21	Experimental	—	Go
55	rajantripathi/soas-rag-evaluation Bilingual retrieval benchmark for culturally grounded QA in English and Uzbek	21	Experimental	—	Python
56	sunilp/enterprise-rag-bench Production RAG patterns for enterprise: chunking strategies, retrieval...	21	Experimental	—	Python
57	amitk741/RAGnarok-AI 🛠️ Evaluate and benchmark your RAG pipelines locally with RAGnarok-AI—no API...	21	Experimental	—	Python
58	SURESHBEEKHANI/LLMops-beginner-to-advanced Short description: RAG evaluation suite for AI Engineering Report	21	Experimental	—	Jupyter Notebook
59	antdragiotis/rag-evaluation-framework-II An evaluation example for Retrieval-Augmented Generation (RAG) that provides...	21	Experimental	—	Jupyter Notebook
60	AKIVA-AI/toolkit-rag-quality Deterministic RAG evaluation toolkit -- retrieval metrics (recall,...	20	Experimental	1	Python
61	shaadclt/EvalRAG A comprehensive evaluation toolkit for assessing Retrieval-Augmented...	20	Experimental	4	Python
62	NamaWho/pyterrier-nuggetizer Nuggetizer: A PyTerrier Open-Source Framework for Evaluating...	20	Experimental	2	Python
63	tsdata/ranx-k Korean-optimized RAG evaluation toolkit with Kiwi tokenizer, ROUGE metrics, ...	19	Experimental	2	Python
64	dbhavery/ragtest RAG evaluation suite — benchmark retrieval accuracy, generation quality, and...	19	Experimental	—	Python
65	JhaAyush01/SEMALEX A comprehensive RAG Evaluation Metric designed to measure the weighted...	18	Experimental	2	Python
66	Mizokuiam/rag-eval-kit A lightweight, modular Python toolkit for evaluating and benchmarking...	18	Experimental	2	Python
67	nidhip1611/GroundedGeo A Benchmark for Citation-Grounded Geographic QA	17	Experimental	—	TeX
68	daniel-e-alarcon/rag-explorer Local-first RAG application with retrieval evaluation (hit@k, MRR) and...	17	Experimental	—	Python
69	iom/evaluation_knowledge A module to turn Evaluation Reports into AI knowledge	16	Experimental	—	HTML
70	yashk1103/Enhanced-Multi-Turn-RAG-Benchmark-Framework Comprehensive benchmarking framework for evaluating 13+ embedding models on...	15	Experimental	—	Python
71	c21051997/ragscope 🏆 An open-source library for the comprehensive, end-to-end evaluation of RAG...	15	Experimental	2	Python
72	rubsj/ai-rag-evaluation-framework RAG pipeline evaluation framework with RAGAS metrics and statistical bias correction	14	Experimental	—	Python
73	Hyeongseob91/research-vlm-based-document-parsing A research framework tA research framework to evaluate how document parsing...	14	Experimental	—	Python
74	alp-oz/rag-metrics RAG-Metrics: A modular framework for evaluating Retrieval-Augmented...	13	Experimental	—	Python
75	hideyuki001/research-rag-instruction-pack Research & Education oriented LangChain RAG framework (5P Principles + EUQS...	13	Experimental	—	Python
76	chandana999/retrieval-evaluation-api RAG retrieval evaluation tool with RAGAS. Compare 6 retriever strategies...	13	Experimental	—	Jupyter Notebook
77	sumit9000/Deep-Evaluation_Rag The Deep Evaluation notebook helps you understand how well your machine...	13	Experimental	—	Jupyter Notebook
78	labofone/rag-eval Reference-free evaluation of Retrieval-Augmented Generation (RAG) pipelines.	13	Experimental	—	Python
79	beingdutta/Self-Refining-Lecture-RAG-For-Educational-Videos Lecture-RAG is a grounding-aware Video-RAG framework that reduces...	13	Experimental	—	Jupyter Notebook
80	ash-hun/BERGEN-UP E2E Evaluation Pipeline for ONLY RAG. Benchmark to BERGEN from NAVER Labs...	12	Experimental	1	Python
81	i-partalas/industrial-rag-qna-benchmark Benchmarking the performance of proprietary vs open-source LLMs in...	12	Experimental	1	Python
82	JhaAyush01/RAG-Evaluation Different approaches to evaluate RAG !!!	11	Experimental	—	Jupyter Notebook
83	Faridghr/Evaluate-RAG-Pipeline Calculating and reporting the performance metrics of RAG pipeline	11	Experimental	—	Jupyter Notebook
84	amruta33/Rag_evaluation Rag Evaluation using ragas	11	Experimental	—	Jupyter Notebook
85	heisguyy/RAG-Benchmark This is a very simplified version of Benchmarking Large Language Models in...	10	Experimental	2	Python
86	a-romero/qevals Evaluation Framework for GenAI applications	10	Experimental	2	Python

Comparisons in this category

open-rag-eval and rageval (53 vs 36) open-rag-eval and rag-evaluator (53 vs 52) open-rag-eval and RAG-evaluation-harnesses (53 vs 35) open-rag-eval and RAG-Evaluator (53 vs 33) XRAG and rageval (53 vs 36) rag-evaluator and rageval (52 vs 36) rageval and RAG-evaluation-harnesses (36 vs 35) rageval and RAG-Evaluator (36 vs 33) XRAG and RAG-evaluation-harnesses (53 vs 35) XRAG and RAG-Performance (53 vs 38)