microsoft/benchmark-qed

Automated benchmarking of Retrieval-Augmented Generation (RAG) systems

52
/ 100
Established

This tool helps AI developers and researchers thoroughly test their Retrieval-Augmented Generation (RAG) systems. It takes your datasets and RAG model answers, then generates standardized queries and uses an LLM to evaluate the RAG system's responses against various quality metrics. This allows for consistent and scalable benchmarking of how well your RAG system performs.

Use this if you are developing or evaluating RAG systems and need an automated, reproducible way to test their accuracy, relevance, and overall performance across different types of queries and datasets.

Not ideal if you are a non-technical end-user looking for a pre-built RAG application, rather than a tool for benchmarking RAG system development.

AI-development LLM-benchmarking RAG-evaluation natural-language-processing model-testing
No Package No Dependents
Maintenance 10 / 25
Adoption 9 / 25
Maturity 16 / 25
Community 17 / 25

How are scores calculated?

Stars

78

Forks

14

Language

Python

License

MIT

Last pushed

Mar 04, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/rag/microsoft/benchmark-qed"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.