microsoft/benchmark-qed

Automated benchmarking of Retrieval-Augmented Generation (RAG) systems

/ 100

Established

This tool helps AI developers and researchers thoroughly test their Retrieval-Augmented Generation (RAG) systems. It takes your datasets and RAG model answers, then generates standardized queries and uses an LLM to evaluate the RAG system's responses against various quality metrics. This allows for consistent and scalable benchmarking of how well your RAG system performs.

Use this if you are developing or evaluating RAG systems and need an automated, reproducible way to test their accuracy, relevance, and overall performance across different types of queries and datasets.

Not ideal if you are a non-technical end-user looking for a pre-built RAG application, rather than a tool for benchmarking RAG system development.

AI-development LLM-benchmarking RAG-evaluation natural-language-processing model-testing

No Package No Dependents

Maintenance 10 / 25

Adoption 9 / 25

Maturity 16 / 25

Community 17 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Featured in

You're Shipping AI You Can't Measure

Related tools

vectara/open-rag-eval

RAG evaluation without the need for "golden answers"

DocAILab/XRAG

XRAG: eXamining the Core - Benchmarking Foundational Component Modules in Advanced...

HZYAI/RagScore

⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in Colab, Jupyter, or...

AIAnytime/rag-evaluator

A library for evaluating Retrieval-Augmented Generation (RAG) systems (The traditional ways).

2501Pr0ject/RAGnarok-AI

Local-first RAG evaluation framework for LLM applications. 100% local, no API keys required.

Explore RAG Tools

All categories Trending RAG directory Insights