LeonEricsson/llmjudge

Exploring limitations of LLM-as-a-judge

22
/ 100
Experimental

This project helps people who evaluate Large Language Models (LLMs) understand how reliable those evaluations are, especially when using another LLM to do the judging. It takes in various prompt templates and LLM responses to a specific task (like identifying misspellings) and shows you how accurately the judging LLM assigns scores. This is useful for anyone building or deploying LLM applications and needs to trust their model's performance metrics.

No commits in the last 6 months.

Use this if you are using LLMs to evaluate other LLMs (LLM-as-a-judge) and want to understand the accuracy and limitations of different prompting strategies for scoring.

Not ideal if you are looking for a plug-and-play solution for general LLM evaluation or if your primary concern is traditional, human-centric evaluation methods.

LLM-evaluation AI-model-testing prompt-engineering natural-language-processing AI-research
No License Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 6 / 25
Maturity 8 / 25
Community 8 / 25

How are scores calculated?

Stars

20

Forks

2

Language

Jupyter Notebook

License

Last pushed

Aug 17, 2024

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/LeonEricsson/llmjudge"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.