haizelabs/verdict

Inference-time scaling for LLMs-as-a-judge.

55
/ 100
Established

This project helps developers and researchers reliably evaluate the quality and safety of their AI applications, particularly those powered by Large Language Models. It takes in the output of an LLM and applies a series of structured judgment steps to produce a consistent and accurate assessment, similar to how human experts might review content. AI developers and researchers who need robust automated evaluation for their LLM applications would use this.

332 stars. Available on PyPI.

Use this if you need to build highly reliable, scalable, and fast automated evaluation systems for your LLM applications, especially for tasks like content moderation, hallucination detection, or fact-checking.

Not ideal if you are looking for a simple, single-prompt evaluation solution for basic or non-critical LLM outputs, or if you don't need to scale complex judgmental processes.

AI-evaluation LLM-guardrails content-moderation AI-safety prompt-engineering
Maintenance 6 / 25
Adoption 10 / 25
Maturity 25 / 25
Community 14 / 25

How are scores calculated?

Stars

332

Forks

24

Language

Jupyter Notebook

License

MIT

Last pushed

Nov 05, 2025

Commits (30d)

0

Dependencies

19

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/haizelabs/verdict"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.