flashclub/ModelJudge

这是一个基于 Next.js 构建的多语言 AI 模型评估平台，支持多模型对比和实时流式响应。A multilingual AI model evaluation platform built with Next.js, allowing users to compare responses from multiple models and receive a final judgment.

/ 100

Emerging

This platform helps AI developers and researchers evaluate the performance of different AI models. You input a question, select up to three models to generate answers, and a fourth model provides a rating and a final answer. The end user is anyone working with AI models who needs to compare their outputs and receive an objective judgment.

Use this if you need to quickly compare the responses of multiple AI models to a specific prompt and get a consolidated judgment.

Not ideal if you need to perform deep, statistical analysis of model performance or integrate evaluations into a larger automated pipeline.

AI-evaluation model-comparison prompt-engineering AI-research developer-tool

No Package No Dependents

Maintenance 6 / 25

Adoption 9 / 25

Maturity 16 / 25

Community 8 / 25

How are scores calculated?

Stars

Forks

Language

TypeScript

License

MIT

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

open-compass/opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral,...

IBM/unitxt

🦄 Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the...

lean-dojo/LeanDojo

Tool for data extraction and interacting with Lean programmatically.

GoodStartLabs/AI_Diplomacy

Frontier Models playing the board game Diplomacy.

google/litmus

Litmus is a comprehensive LLM testing and evaluation tool designed for GenAI Application...

Explore LLM Tools

All categories Trending LLM Tool directory Insights