ibm-self-serve-assets/JudgeIt-LLM-as-a-Judge

Automation Framework using LLM-as-a-judge to evaluate of Agentic AI, RAG, Text2SQL at scale; that is a good proxy for human judgement.

40
/ 100
Emerging

This framework helps AI engineers rapidly and accurately evaluate the performance of Generative AI pipelines like RAG, multi-turn conversations, and text-to-SQL. It takes datasets of AI-generated text and expected "golden" text, then uses an LLM to automatically judge their quality. This enables AI engineers to test many more pipeline versions faster than manual human review.

No commits in the last 6 months.

Use this if you need to quickly and reliably assess the quality of outputs from your Generative AI applications at scale, replacing slow and expensive human evaluators.

Not ideal if you only need a quick, one-off evaluation or if your Generative AI outputs are not text-based.

Generative AI AI Evaluation RAG Agentic AI AI Quality Assurance
Stale 6m No Package No Dependents
Maintenance 2 / 25
Adoption 7 / 25
Maturity 16 / 25
Community 15 / 25

How are scores calculated?

Stars

34

Forks

6

Language

Python

License

Apache-2.0

Last pushed

Oct 09, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/rag/ibm-self-serve-assets/JudgeIt-LLM-as-a-Judge"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.