haizelabs/verdict

Inference-time scaling for LLMs-as-a-judge.

/ 100

Established

This project helps developers and researchers reliably evaluate the quality and safety of their AI applications, particularly those powered by Large Language Models. It takes in the output of an LLM and applies a series of structured judgment steps to produce a consistent and accurate assessment, similar to how human experts might review content. AI developers and researchers who need robust automated evaluation for their LLM applications would use this.

332 stars. Available on PyPI.

Use this if you need to build highly reliable, scalable, and fast automated evaluation systems for your LLM applications, especially for tasks like content moderation, hallucination detection, or fact-checking.

Not ideal if you are looking for a simple, single-prompt evaluation solution for basic or non-critical LLM outputs, or if you don't need to scale complex judgmental processes.

AI-evaluation LLM-guardrails content-moderation AI-safety prompt-engineering

Maintenance 6 / 25

Adoption 10 / 25

Maturity 25 / 25

Community 14 / 25

How are scores calculated?

Stars

332

Forks

Language

Jupyter Notebook

License

MIT

Related models

jncraton/languagemodels

Explore large language models in 512MB of RAM

microsoft/unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

albertan017/LLM4Decompile

Reverse Engineering: Decompiling Binary Code with Large Language Models

bytedance/Sa2VA

Official Repo For Pixel-LLM Codebase

Cardinal-Operations/ORLM

ORLM: Training Large Language Models for Optimization Modeling

Explore Transformer Models

All categories Trending Transformer directory Insights