haizelabs/verdict
Inference-time scaling for LLMs-as-a-judge.
This project helps developers and researchers reliably evaluate the quality and safety of their AI applications, particularly those powered by Large Language Models. It takes in the output of an LLM and applies a series of structured judgment steps to produce a consistent and accurate assessment, similar to how human experts might review content. AI developers and researchers who need robust automated evaluation for their LLM applications would use this.
332 stars. Available on PyPI.
Use this if you need to build highly reliable, scalable, and fast automated evaluation systems for your LLM applications, especially for tasks like content moderation, hallucination detection, or fact-checking.
Not ideal if you are looking for a simple, single-prompt evaluation solution for basic or non-critical LLM outputs, or if you don't need to scale complex judgmental processes.
Stars
332
Forks
24
Language
Jupyter Notebook
License
MIT
Category
Last pushed
Nov 05, 2025
Commits (30d)
0
Dependencies
19
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/haizelabs/verdict"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related models
jncraton/languagemodels
Explore large language models in 512MB of RAM
microsoft/unilm
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
albertan017/LLM4Decompile
Reverse Engineering: Decompiling Binary Code with Large Language Models
bytedance/Sa2VA
Official Repo For Pixel-LLM Codebase
Cardinal-Operations/ORLM
ORLM: Training Large Language Models for Optimization Modeling