deshwalmahesh/PHUDGE
Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute, relative and much more. It contains a list of all the available tool, methods, repo, code etc to detect hallucination, LLM evaluation, grading and much more.
This tool helps you objectively assess the quality of responses generated by your Large Language Models (LLMs) or even human-written answers. You provide a question and a response, and it gives you a quality score from 1-5. It's ideal for anyone who needs to ensure the accuracy and helpfulness of AI-generated content or human agents in customer support, content creation, or knowledge management.
No commits in the last 6 months.
Use this if you need a scalable and robust way to grade LLM or human responses, especially when you want to use custom scoring criteria or don't have a perfect reference answer available.
Not ideal if you are looking for a simple, out-of-the-box solution that doesn't require any technical setup or if you only need basic, qualitative feedback without numerical grading.
Stars
52
Forks
7
Language
Jupyter Notebook
License
—
Category
Last pushed
Jul 10, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/deshwalmahesh/PHUDGE"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
EvolvingLMMs-Lab/lmms-eval
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
vibrantlabsai/ragas
Supercharge Your LLM Application Evaluations 🚀
open-compass/VLMEvalKit
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
EuroEval/EuroEval
The robust European language model benchmark.
Giskard-AI/giskard-oss
🐢 Open-Source Evaluation & Testing library for LLM Agents