LeonEricsson/llmjudge

Exploring limitations of LLM-as-a-judge

/ 100

Experimental

This project helps people who evaluate Large Language Models (LLMs) understand how reliable those evaluations are, especially when using another LLM to do the judging. It takes in various prompt templates and LLM responses to a specific task (like identifying misspellings) and shows you how accurately the judging LLM assigns scores. This is useful for anyone building or deploying LLM applications and needs to trust their model's performance metrics.

No commits in the last 6 months.

Use this if you are using LLMs to evaluate other LLMs (LLM-as-a-judge) and want to understand the accuracy and limitations of different prompting strategies for scoring.

Not ideal if you are looking for a plug-and-play solution for general LLM evaluation or if your primary concern is traditional, human-centric evaluation methods.

LLM-evaluation AI-model-testing prompt-engineering natural-language-processing AI-research

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 6 / 25

Maturity 8 / 25

Community 8 / 25

How are scores calculated?

Stars

Forks

Language

Jupyter Notebook

License

—

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

EvolvingLMMs-Lab/lmms-eval

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

vibrantlabsai/ragas

Supercharge Your LLM Application Evaluations 🚀

open-compass/VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

EuroEval/EuroEval

The robust European language model benchmark.

Giskard-AI/giskard-oss

🐢 Open-Source Evaluation & Testing library for LLM Agents

Explore LLM Tools

All categories Trending LLM Tool directory Insights