kolenaIO/autoarena

Rank LLMs, RAG systems, and prompts using automated head-to-head evaluation

37
/ 100
Emerging

This tool helps you objectively compare and rank the outputs of different Large Language Models (LLMs), RAG systems, or prompt designs. You feed in prompts and the corresponding responses from multiple models, and it automatically evaluates them head-to-head using AI judges. The outcome is a clear leaderboard, showing which configuration performs best. This is ideal for AI engineers, data scientists, and product managers working on LLM-powered applications.

108 stars. No commits in the last 6 months.

Use this if you need to determine the best-performing LLM, RAG setup, or prompt by systematically comparing their outputs using automated judging.

Not ideal if you prefer manual human evaluation for every comparison or if your primary need is to evaluate a single model in isolation rather than comparing multiple.

LLM-evaluation prompt-engineering RAG-system-optimization AI-model-selection chatbot-performance
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 9 / 25
Maturity 16 / 25
Community 12 / 25

How are scores calculated?

Stars

108

Forks

10

Language

TypeScript

License

Apache-2.0

Last pushed

Dec 16, 2024

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/rag/kolenaIO/autoarena"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.