ibm-self-serve-assets/JudgeIt-LLM-as-a-Judge

Automation Framework using LLM-as-a-judge to evaluate of Agentic AI, RAG, Text2SQL at scale; that is a good proxy for human judgement.

/ 100

Emerging

This framework helps AI engineers rapidly and accurately evaluate the performance of Generative AI pipelines like RAG, multi-turn conversations, and text-to-SQL. It takes datasets of AI-generated text and expected "golden" text, then uses an LLM to automatically judge their quality. This enables AI engineers to test many more pipeline versions faster than manual human review.

No commits in the last 6 months.

Use this if you need to quickly and reliably assess the quality of outputs from your Generative AI applications at scale, replacing slow and expensive human evaluators.

Not ideal if you only need a quick, one-off evaluation or if your Generative AI outputs are not text-based.

Generative AI AI Evaluation RAG Agentic AI AI Quality Assurance

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 7 / 25

Maturity 16 / 25

Community 15 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

Apache-2.0

Higher-rated alternatives

amazon-science/auto-rag-eval

Code repo for the ICML 2024 paper "Automated Evaluation of Retrieval-Augmented Language Models...

explore-de/rage4j

Evaluate your LLM based Java Apps

mit-ll-ai-technology/llm-sandbox

Large language model evaluation framework for logic and open-ended Q&A with a vareity of RAG and...

nl4opt/ORQA

[AAAI 2025] ORQA is a new QA benchmark designed to assess the reasoning capabilities of LLMs in...

Explore RAG Tools

All categories Trending RAG directory Insights