ibm-self-serve-assets/JudgeIt-LLM-as-a-Judge
Automation Framework using LLM-as-a-judge to evaluate of Agentic AI, RAG, Text2SQL at scale; that is a good proxy for human judgement.
This framework helps AI engineers rapidly and accurately evaluate the performance of Generative AI pipelines like RAG, multi-turn conversations, and text-to-SQL. It takes datasets of AI-generated text and expected "golden" text, then uses an LLM to automatically judge their quality. This enables AI engineers to test many more pipeline versions faster than manual human review.
No commits in the last 6 months.
Use this if you need to quickly and reliably assess the quality of outputs from your Generative AI applications at scale, replacing slow and expensive human evaluators.
Not ideal if you only need a quick, one-off evaluation or if your Generative AI outputs are not text-based.
Stars
34
Forks
6
Language
Python
License
Apache-2.0
Category
Last pushed
Oct 09, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/rag/ibm-self-serve-assets/JudgeIt-LLM-as-a-Judge"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
amazon-science/auto-rag-eval
Code repo for the ICML 2024 paper "Automated Evaluation of Retrieval-Augmented Language Models...
explore-de/rage4j
Evaluate your LLM based Java Apps
mit-ll-ai-technology/llm-sandbox
Large language model evaluation framework for logic and open-ended Q&A with a vareity of RAG and...
nl4opt/ORQA
[AAAI 2025] ORQA is a new QA benchmark designed to assess the reasoning capabilities of LLMs in...