Contextualist/lone-arena
Self-hosted LLM chatbot arena, with yourself as the only judge
This tool helps you manually compare and evaluate responses from different fine-tuned language models. You input your specific prompts and model endpoints, and it presents you with pairs of responses for you to judge. It's designed for researchers or practitioners who need to assess LLM performance in specialized domains where automated benchmarks or third-party evaluations aren't suitable.
No commits in the last 6 months.
Use this if you need a confidential, customizable way to human-evaluate multiple large language models on your specific tasks and data.
Not ideal if you prefer fully automated benchmarking or if your evaluation criteria can be adequately addressed by existing public benchmarks.
Stars
41
Forks
5
Language
Python
License
MIT
Category
Last pushed
Feb 06, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/Contextualist/lone-arena"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
EvolvingLMMs-Lab/lmms-eval
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
vibrantlabsai/ragas
Supercharge Your LLM Application Evaluations 🚀
open-compass/VLMEvalKit
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
EuroEval/EuroEval
The robust European language model benchmark.
Giskard-AI/giskard-oss
🐢 Open-Source Evaluation & Testing library for LLM Agents