Vvkmnn/awesome-ai-eval

☑️ A curated list of tools, methods & platforms for evaluating AI reliability in real applications.

/ 100

Emerging

This is a curated list of tools, methods, and platforms designed to help you verify that your AI models, like large language models or autonomous agents, are working reliably and not producing undesirable outputs such as 'hallucinations'. It helps you input your AI's behavior and desired outcomes to assess its performance against various benchmarks. This resource is for AI practitioners, machine learning engineers, and product managers responsible for developing, deploying, and maintaining AI systems in real-world applications.

Use this if you need to thoroughly test, debug, and monitor the quality and reliability of your AI models and applications in production.

Not ideal if you are looking for general machine learning development resources that are not specifically focused on AI evaluation.

AI evaluation LLM testing AI reliability production AI monitoring RAG pipeline testing

No Package No Dependents

Maintenance 10 / 25

Adoption 8 / 25

Maturity 13 / 25

Community 11 / 25

How are scores calculated?

Stars

Forks

Language

—

License

CC0-1.0

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

open-compass/opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral,...

IBM/unitxt

🦄 Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the...

lean-dojo/LeanDojo

Tool for data extraction and interacting with Lean programmatically.

GoodStartLabs/AI_Diplomacy

Frontier Models playing the board game Diplomacy.

google/litmus

Litmus is a comprehensive LLM testing and evaluation tool designed for GenAI Application...

Explore LLM Tools

All categories Trending LLM Tool directory Insights