microsoft/benchmark-qed
Automated benchmarking of Retrieval-Augmented Generation (RAG) systems
This tool helps AI developers and researchers thoroughly test their Retrieval-Augmented Generation (RAG) systems. It takes your datasets and RAG model answers, then generates standardized queries and uses an LLM to evaluate the RAG system's responses against various quality metrics. This allows for consistent and scalable benchmarking of how well your RAG system performs.
Use this if you are developing or evaluating RAG systems and need an automated, reproducible way to test their accuracy, relevance, and overall performance across different types of queries and datasets.
Not ideal if you are a non-technical end-user looking for a pre-built RAG application, rather than a tool for benchmarking RAG system development.
Stars
78
Forks
14
Language
Python
License
MIT
Category
Last pushed
Mar 04, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/rag/microsoft/benchmark-qed"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Related tools
vectara/open-rag-eval
RAG evaluation without the need for "golden answers"
DocAILab/XRAG
XRAG: eXamining the Core - Benchmarking Foundational Component Modules in Advanced...
HZYAI/RagScore
⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in Colab, Jupyter, or...
AIAnytime/rag-evaluator
A library for evaluating Retrieval-Augmented Generation (RAG) systems (The traditional ways).
2501Pr0ject/RAGnarok-AI
Local-first RAG evaluation framework for LLM applications. 100% local, no API keys required.