LeonEricsson/llmjudge
Exploring limitations of LLM-as-a-judge
This project helps people who evaluate Large Language Models (LLMs) understand how reliable those evaluations are, especially when using another LLM to do the judging. It takes in various prompt templates and LLM responses to a specific task (like identifying misspellings) and shows you how accurately the judging LLM assigns scores. This is useful for anyone building or deploying LLM applications and needs to trust their model's performance metrics.
No commits in the last 6 months.
Use this if you are using LLMs to evaluate other LLMs (LLM-as-a-judge) and want to understand the accuracy and limitations of different prompting strategies for scoring.
Not ideal if you are looking for a plug-and-play solution for general LLM evaluation or if your primary concern is traditional, human-centric evaluation methods.
Stars
20
Forks
2
Language
Jupyter Notebook
License
—
Category
Last pushed
Aug 17, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/LeonEricsson/llmjudge"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
EvolvingLMMs-Lab/lmms-eval
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
vibrantlabsai/ragas
Supercharge Your LLM Application Evaluations 🚀
open-compass/VLMEvalKit
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
EuroEval/EuroEval
The robust European language model benchmark.
Giskard-AI/giskard-oss
🐢 Open-Source Evaluation & Testing library for LLM Agents