RAG Evaluation Frameworks RAG Tools

Tools and benchmarks for assessing RAG system performance across metrics like retrieval quality, generation accuracy, and end-to-end pipeline evaluation. Does NOT include RAG implementations themselves, embedding model comparisons, or domain-specific applications.

There are 86 rag evaluation frameworks tools tracked. 5 score above 50 (established tier). The highest-rated is vectara/open-rag-eval at 53/100 with 347 stars.

Get all 86 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=rag&subcategory=rag-evaluation-frameworks&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 vectara/open-rag-eval

RAG evaluation without the need for "golden answers"

53
Established
2 DocAILab/XRAG

XRAG: eXamining the Core - Benchmarking Foundational Component Modules in...

53
Established
3 HZYAI/RagScore

⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in...

53
Established
4 AIAnytime/rag-evaluator

A library for evaluating Retrieval-Augmented Generation (RAG) systems (The...

52
Established
5 microsoft/benchmark-qed

Automated benchmarking of Retrieval-Augmented Generation (RAG) systems

52
Established
6 2501Pr0ject/RAGnarok-AI

Local-first RAG evaluation framework for LLM applications. 100% local, no...

46
Emerging
7 nuclia/nuclia-eval

Library for evaluating RAG using Nuclia's models

43
Emerging
8 TonicAI/tonic_validate

Metrics to evaluate the quality of responses of your Retrieval Augmented...

43
Emerging
9 syy12335/rag-eval-scaffold

Lightweight, decoupled RAG evaluation scaffold (dataset → vector store → RAG...

40
Emerging
10 AQ-MedAI/PRGB

[AAAI 2026]RAG, Benchmark, robust RAG generation

39
Emerging
11 SciPhi-AI/RAG-Performance

Measuring RAG solutions throughput and latency

38
Emerging
12 avnlp/rag-pipelines

Advanced RAG Pipelines and Evaluation

38
Emerging
13 vectara/mirage-bench

Repository for Multililngual Generation, RAG evaluations, and surrogate...

37
Emerging
14 gomate-community/rageval

Evaluation tools for Retrieval-augmented Generation (RAG) methods.

36
Emerging
15 AQ-MedAI/RagQALeaderboard

RAG-QA Leaderboard

36
Emerging
16 RulinShao/RAG-evaluation-harnesses

An evaluation suite for Retrieval-Augmented Generation (RAG).

35
Emerging
17 RUC-NLPIR/OmniEval

Open source code of the paper: "OmniEval: An Omnidirectional and Automatic...

35
Emerging
18 IAAR-Shanghai/CRUD_RAG

CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented...

34
Emerging
19 GURPREETKAURJETHRA/RAG-Evaluator

A library for evaluating Retrieval-Augmented Generation (RAG) systems

33
Emerging
20 TonicAI/tvallogging

A tool for evaluating and tracking your RAG experiments. This repo contains...

33
Emerging
21 RagView/RagView

We believe that every SOTA result is only valid on its own dataset. RAGView...

32
Emerging
22 gomate-community/rag-bench

RAG-Bench is to summarize all datasets used to evaluate RAG, from document...

30
Emerging
23 sitta07/RAGScope

A lightweight observability tool for visualizing and comparing RAG retrieval...

29
Experimental
24 amazon-science/MEMERAG

MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval...

29
Experimental
25 amazon-science/GaRAGe

[ACL 2025] GaRAGe: A Benchmark with Grounding Annotations for RAG Evaluation.

28
Experimental
26 antgroup/ravig-bench

Official implementation of "RAViG-Bench: A Benchmark for Retrieval-Augmented...

28
Experimental
27 christopherkormpos/ragret

Lightweight evaluation framework for Retrieval Augmented Generation systems,...

28
Experimental
28 foreai-co/fore

The fore client package

27
Experimental
29 oztrkoguz/RAG-Framework-Evaluation

This project aims to compare different Retrieval-Augmented Generation (RAG)...

27
Experimental
30 ALucek/custom-rag-evals

Applying domain specific evaluations to RAG chunking and embedding functions

26
Experimental
31 chu2bard/ragcraft

End-to-end RAG pipeline with built-in evaluation metrics

26
Experimental
32 Edouard-Legoupil/rag_extraction

A tutorial on how to build Summary Brief from Evaluation Report - Offline+Open Source

25
Experimental
33 tarekmasryo/rag-qa-logs-corpus-data

Synthetic multi-table RAG QA telemetry benchmark...

25
Experimental
34 Abanoubr/rag-eval-toolkit

Open-source Python toolkit for evaluating RAG pipelines. LLM-as-judge for...

25
Experimental
35 rodolfboctor/rag-eval-toolkit

Open-source Python toolkit for evaluating RAG pipelines. LLM-as-judge for...

25
Experimental
36 utkuakbay/RAG_Benchmark

Benchmark LLMs for your RAG system - Compare Gemini, GPT, Claude & local...

24
Experimental
37 Ziqing110/rag-evidence-attack-lab

Scientific QA robustness evaluation pipeline for evidence-missing RAG...

23
Experimental
38 Sabyasachig/ragtrace

DevTools for RAG pipelines

23
Experimental
39 ssisOneTeam/Korean-Embedding-Model-Performance-Benchmark-for-Retriever

Korean Sentence Embedding Model Performance Benchmark for RAG

23
Experimental
40 Monke1/ragcraft

📚 Build and evaluate RAG pipelines to ingest, embed, retrieve, and answer...

22
Experimental
41 hari-sherith/bayesian-rag-uncertainty

RAG system with Bayesian uncertainty quantification using Beta priors and...

22
Experimental
42 nblomerus/rag-bench

RAG system for asking questions about AI/ML research papers

22
Experimental
43 wigtn/wigtnOCR-v1

A research framework tA research framework to evaluate how document parsing...

22
Experimental
44 OpenSymbolicAI/benchmark-py-MultiHopRAG

MultiHop-RAG Benchmark using GoalSeeking pattern from opensymbolicai-core

22
Experimental
45 Miro96/nova-rag-benchmark

Benchmark for Code RAG MCP Servers — measure how well RAG helps AI find the...

22
Experimental
46 EmmanuelleB985/mmeval-vrag

Evaluation Framework for Multimodal RAG Systems

22
Experimental
47 clouatre-labs/rag-reranking-benchmarks

Supplementary benchmarks for Making Legacy Knowledge Searchable with RAG

22
Experimental
48 Aamirofficiall/rag-playbook

Stop guessing which RAG pattern to use. Compare all 8 patterns with real...

22
Experimental
49 Eustema-S-p-A/SCARF

SCARF (System for Comprehensive Assessment of RAG Frameworks) is a modular...

22
Experimental
50 fkapsahili/EntRAG

EntRAG - Enterprise RAG Benchmark

22
Experimental
51 srivsr/evalkit

QA-grade RAG evaluation framework diagnosing retrieval, grounding,...

21
Experimental
52 anita-builds/aurora-rag-evaluation

Policy-grounded assistant notes: RAG and evaluation approach

21
Experimental
53 infrixo-systems/rag-evaluation-starter

Minimal Python script to evaluate your RAG pipeline against a golden set. No...

21
Experimental
54 itamaker/ragcheck

Score retrieval runs with Precision@k, Recall@k, HitRate@k, and MRR@k.

21
Experimental
55 rajantripathi/soas-rag-evaluation

Bilingual retrieval benchmark for culturally grounded QA in English and Uzbek

21
Experimental
56 sunilp/enterprise-rag-bench

Production RAG patterns for enterprise: chunking strategies, retrieval...

21
Experimental
57 amitk741/RAGnarok-AI

🛠️ Evaluate and benchmark your RAG pipelines locally with RAGnarok-AI—no API...

21
Experimental
58 SURESHBEEKHANI/LLMops-beginner-to-advanced

Short description: RAG evaluation suite for AI Engineering Report

21
Experimental
59 antdragiotis/rag-evaluation-framework-II

An evaluation example for Retrieval-Augmented Generation (RAG) that provides...

21
Experimental
60 AKIVA-AI/toolkit-rag-quality

Deterministic RAG evaluation toolkit -- retrieval metrics (recall,...

20
Experimental
61 shaadclt/EvalRAG

A comprehensive evaluation toolkit for assessing Retrieval-Augmented...

20
Experimental
62 NamaWho/pyterrier-nuggetizer

Nuggetizer: A PyTerrier Open-Source Framework for Evaluating...

20
Experimental
63 tsdata/ranx-k

Korean-optimized RAG evaluation toolkit with Kiwi tokenizer, ROUGE metrics, ...

19
Experimental
64 dbhavery/ragtest

RAG evaluation suite — benchmark retrieval accuracy, generation quality, and...

19
Experimental
65 JhaAyush01/SEMALEX

A comprehensive RAG Evaluation Metric designed to measure the weighted...

18
Experimental
66 Mizokuiam/rag-eval-kit

A lightweight, modular Python toolkit for evaluating and benchmarking...

18
Experimental
67 nidhip1611/GroundedGeo

A Benchmark for Citation-Grounded Geographic QA

17
Experimental
68 daniel-e-alarcon/rag-explorer

Local-first RAG application with retrieval evaluation (hit@k, MRR) and...

17
Experimental
69 iom/evaluation_knowledge

A module to turn Evaluation Reports into AI knowledge

16
Experimental
70 yashk1103/Enhanced-Multi-Turn-RAG-Benchmark-Framework

Comprehensive benchmarking framework for evaluating 13+ embedding models on...

15
Experimental
71 c21051997/ragscope

🏆 An open-source library for the comprehensive, end-to-end evaluation of RAG...

15
Experimental
72 rubsj/ai-rag-evaluation-framework

RAG pipeline evaluation framework with RAGAS metrics and statistical bias correction

14
Experimental
73 Hyeongseob91/research-vlm-based-document-parsing

A research framework tA research framework to evaluate how document parsing...

14
Experimental
74 alp-oz/rag-metrics

RAG-Metrics: A modular framework for evaluating Retrieval-Augmented...

13
Experimental
75 hideyuki001/research-rag-instruction-pack

Research & Education oriented LangChain RAG framework (5P Principles + EUQS...

13
Experimental
76 chandana999/retrieval-evaluation-api

RAG retrieval evaluation tool with RAGAS. Compare 6 retriever strategies...

13
Experimental
77 sumit9000/Deep-Evaluation_Rag

The Deep Evaluation notebook helps you understand how well your machine...

13
Experimental
78 labofone/rag-eval

Reference-free evaluation of Retrieval-Augmented Generation (RAG) pipelines.

13
Experimental
79 beingdutta/Self-Refining-Lecture-RAG-For-Educational-Videos

Lecture-RAG is a grounding-aware Video-RAG framework that reduces...

13
Experimental
80 ash-hun/BERGEN-UP

E2E Evaluation Pipeline for ONLY RAG. Benchmark to BERGEN from NAVER Labs...

12
Experimental
81 i-partalas/industrial-rag-qna-benchmark

Benchmarking the performance of proprietary vs open-source LLMs in...

12
Experimental
82 JhaAyush01/RAG-Evaluation

Different approaches to evaluate RAG !!!

11
Experimental
83 Faridghr/Evaluate-RAG-Pipeline

Calculating and reporting the performance metrics of RAG pipeline

11
Experimental
84 amruta33/Rag_evaluation

Rag Evaluation using ragas

11
Experimental
85 heisguyy/RAG-Benchmark

This is a very simplified version of Benchmarking Large Language Models in...

10
Experimental
86 a-romero/qevals

Evaluation Framework for GenAI applications

10
Experimental