Vvkmnn/awesome-ai-eval
☑️ A curated list of tools, methods & platforms for evaluating AI reliability in real applications.
This is a curated list of tools, methods, and platforms designed to help you verify that your AI models, like large language models or autonomous agents, are working reliably and not producing undesirable outputs such as 'hallucinations'. It helps you input your AI's behavior and desired outcomes to assess its performance against various benchmarks. This resource is for AI practitioners, machine learning engineers, and product managers responsible for developing, deploying, and maintaining AI systems in real-world applications.
Use this if you need to thoroughly test, debug, and monitor the quality and reliability of your AI models and applications in production.
Not ideal if you are looking for general machine learning development resources that are not specifically focused on AI evaluation.
Stars
66
Forks
7
Language
—
License
CC0-1.0
Category
Last pushed
Feb 12, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/Vvkmnn/awesome-ai-eval"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
open-compass/opencompass
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral,...
IBM/unitxt
🦄 Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the...
lean-dojo/LeanDojo
Tool for data extraction and interacting with Lean programmatically.
GoodStartLabs/AI_Diplomacy
Frontier Models playing the board game Diplomacy.
google/litmus
Litmus is a comprehensive LLM testing and evaluation tool designed for GenAI Application...