LLM Evaluation Platforms Generative AI Tools

Tools for testing, evaluating, and monitoring LLM applications in production—including automated evaluation frameworks, A/B testing, observability, quality control, and performance tracking. Does NOT include general ML ops platforms, code generation tools, or domain-specific AI applications.

There are 119 llm evaluation platforms tools tracked. 1 score above 70 (verified tier). The highest-rated is openvinotoolkit/model_server at 71/100 with 836 stars. 2 of the top 10 are actively maintained.

Get all 119 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=generative-ai&subcategory=llm-evaluation-platforms&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	openvinotoolkit/model_server A scalable inference server for models optimized with OpenVINO™	71	Verified	836	C++
2	madroidmaq/mlx-omni-server MLX Omni Server is a local inference server powered by Apple's MLX...	64	Established	678	Python
3	NVIDIA-NeMo/Guardrails NeMo Guardrails is an open-source toolkit for easily adding programmable...	63	Established	5,772	Python
4	generative-computing/mellea Mellea is a library for writing generative programs.	58	Established	341	Python
5	rhesis-ai/rhesis Open-source platform & SDK for testing LLM and agentic apps. Define expected...	58	Established	296	Python
6	taco-group/OpenEMMA OpenEMMA, a permissively licensed open source "reproduction" of Waymo’s EMMA model.	57	Established	906	Python
7	cncf/llm-starter-pack 🤖 Get started with LLMs on your kind cluster, today!	53	Established	172	Python
8	modular/max-agentic-cookbook MAX Agentic Cookbook	49	Emerging	74	HTML
9	cuckoo-network/cuckoo Cuckoo is a Decentralized AI Model-Serving Platform, starting with...	48	Emerging	407	TypeScript
10	hichipli/vetting-python A Python implementation of the VETTING (Verification and Evaluation Tool for...	47	Emerging	10	Python
11	aws-samples/foundation-model-benchmarking-tool Foundation model benchmarking tool. Run any model on any AWS platform and...	46	Emerging	255	Jupyter Notebook
12	AMDResearch/intelliperf Automated bottleneck detection and solution orchestration	44	Emerging	19	Python
13	clearml/clearml-fractional-gpu ClearML Fractional GPU - Run multiple containers on the same GPU with driver...	44	Emerging	90	—
14	amazon-science/fmcore Running Foundation Models at every scale, on every modality. Includes...	43	Emerging	6	Python
15	sandner-art/ArtAgents Framework for LLM based captioning and prompt engineering	42	Emerging	14	Python
16	aimonlabs/aimon-python-sdk This repo hosts the Python SDK and related examples for AIMon, which is a...	42	Emerging	11	Python
17	Aaryanverma/trustifai TrustifAI: A Comprehensive Framework for AI Trustworthiness	42	Emerging	10	Python
18	jordanvolz/lolpop A software engineering framework to jump start your machine learning projects	42	Emerging	37	Python
19	vienneraphael/batchling Save 50% off GenAI costs in two lines of code	40	Emerging	17	Python
20	maximhq/maxim-cookbooks Maxim is an end-to-end AI evaluation and observability platform that...	40	Emerging	13	Jupyter Notebook
21	yankeexe/ollama-manager 🦙 Manage Ollama models from your CLI!	38	Emerging	16	Python
22	svilupp/Julia-LLM-Leaderboard Provides a platform for the Julia community to compare AI models' abilities...	38	Emerging	86	HTML
23	kstathou/llm-stack End-to-end tech stack for the LLM data flywheel	38	Emerging	3	Python
24	soundstarrain/LLM-Filter-Probe 一款针对 LLM 输入侧审查的精确逆向分析工具。自动定位 NewAPI、OneAPI 及任何实施基于字典规则进行 Prompt 过滤的 API...	38	Emerging	3	Python
25	autonomi-ai/nos ⚡️ A fast and flexible PyTorch inference server that runs locally, on any...	38	Emerging	147	Python
26	Finoptimize/agentaflow-sro-community Manage AI and Machine Learning workloads more efficiently with lower cost: ...	37	Emerging	2	Go
27	amazon-science/concurry Easy scaling for AI research and production workloads	37	Emerging	14	Python
28	sMiNT0S/AIBugBench From prompt to paste: evaluate AI / LLM output under a strict Python sandbox...	34	Emerging	1	Python
29	retkowsky/foundry-local Foundry Local is an on-device AI inference solution that you use to run AI...	33	Emerging	9	Jupyter Notebook
30	unit-mesh/devops-genius DevOpsGenius 旨在结合 LLM 重塑软件开发中的 DevOps 实践。将 LLM 视为团队的初级...	33	Emerging	92	Kotlin
31	llm-platform-security/gpt-data-exposure An In-Depth Investigation of Data Collection in LLM App Ecosystems	33	Emerging	3	Python
32	rpjayaraman/LLMxVLSI Generate, Simulate & Summarize Verilog Code with GenAI and Iverilog tool	32	Emerging	5	Python
33	Generative-Engine-Marketing/GEM-Bench First comprehensive benchmark for Generative Engine Marketing (GEM), an...	32	Emerging	15	Python
34	LLMConsent/llmconsent-standards LLMConsent is an open protocol that establishes standards for managing...	31	Emerging	2	—
35	hiamitabha/genai-bench Code to benchmark APIs available from LLM vendors and demostrate how they work	31	Emerging	4	Python
36	djokester/groqeval Use groq for evaluations	31	Emerging	3	Python
37	fmind/mlops-digester A tool equipping Pydantic AI agents with the ability to digest and summarize...	31	Emerging	4	Python
38	iservicebus/lmaas LMaaS (Language Model as a Service) abstracts away complexities and enables...	30	Emerging	2	Python
39	nginH/llmforge One API, every AI model, instant switching. Change from GPT-4 to Gemini to...	30	Emerging	6	TypeScript
40	SAP-samples/llm-round-trip-correctness This repo provides code for evaluation of llm round-trip-correctness on text...	30	Emerging	6	Jupyter Notebook
41	verma-kunal/k8sGPT-tutorial This repo is dedicated for the K8sGPT tutorial on Kubesimplify's YT channel.	29	Experimental	1	—
42	evalops/eval2otel Library to convert AI evaluation results to OpenTelemetry GenAI semantic...	29	Experimental	3	TypeScript
43	danilop/llm-test-mate A simple testing framework to evaluate and validate LLM-generated content...	28	Experimental	10	Python
44	maharshijani05/CivicMind CivicMind is an AI-powered civic policy simulator where intelligent agents...	28	Experimental	3	Python
45	nyno-ai/nynoflow Production grade framework for LLM application development	27	Experimental	2	Python
46	robocorp/llmfoo Code with the flow of a river, refactor with the grace of a breeze, and...	27	Experimental	14	Python
47	demml/potatohead 🥔 Quality control for your potato heads (LLMs)	27	Experimental	1	Rust
48	Portkey-AI/helm-chart Kubernetes Configs for Portkey Gateway deployment	27	Experimental	3	Smarty
49	Yapakayala/cloudops-ai-monitor 🔍 Monitor cloud environments with AI-driven insights, anomaly detection, and...	26	Experimental	1	Python
50	noct-ml/noesis Noesis - A lightweight toolkit for inspecting transformer internals through...	26	Experimental	1	Python
51	paralleliq/piqc-knowledge-base Production-ready checklists and frameworks for deploying LLMs, GenAI models,...	25	Experimental	2	—
52	hipvlady/subzero Project SubZeo: Zero Trust AI Gateway (ZTAG)	24	Experimental	3	Python
53	Tradunsky/3D-guardrails 3D content you can trust	24	Experimental	1	Python
54	AdityaPatange1/okesa Okesa: LLM-powered Natural Language Processing! 💬	24	Experimental	1	TypeScript
55	Ashik245-commits/LLM-Filter-Probe 🕵️♂️ Analyze and reverse engineer keyword filtering in large language models...	24	Experimental	1	Python
56	sugihAF/DomainBench LLM Benchmark and Comparison on Domain Specific Implementation	24	Experimental	1	Python
57	radlab-dev-group/llm-router-plugins A companion repository for llm-router containing a collection of...	24	Experimental	1	Python
58	krish567366/automl_self_improvement A next-gen toolkit for autonomous machine learning that automatically...	23	Experimental	1	Python
59	ozanunal0/Prometheus-Gateway An open-source, security-first LLM Gateway designed to provide a unified,...	22	Experimental	10	Python
60	josephlash10-svg/Glass-Box A Python-based framework for managing LLM drift and preventing model...	22	Experimental	1	Python
61	last9/python-ai-sdk OpenTelemetry extension for LLM observability - track conversations,...	22	Experimental	1	Python
62	valohai/valohai-llm Track and report LLM and GenAI evaluations to Valohai LLM	22	Experimental	1	Python
63	leaxer-ai/leaxer An engine for local AI inference, built on Elixir and the BEAM virtual machine.	22	Experimental	1	Elixir
64	SangiSI/llm-model-selection-lab Decision-centric evaluation lab for intelligent LLM model selection using...	22	Experimental	1	Python
65	eneagizzarelli/SYNAPSE SYNAPSE (SYNthetic AI Pot for Security Enhancement) and SYNAPSE-to-MITRE...	22	Experimental	16	Python
66	Mrdodo446/ModelForge Build and customize machine learning models efficiently with an open-source...	22	Experimental	—	TypeScript
67	mauryasameer/llm_eval SR 11-7 & EU AI Act compliant LLM validation framework for financial...	22	Experimental	—	Python
68	svilupp/Logfire.jl Observability for Julia LLM applications. Know what your AI is doing.	21	Experimental	2	Julia
69	hari7261/indus-llm-gateway Production-ready LLM gateway — unified OpenAI-compatible API for all...	21	Experimental	—	Go
70	adityonugrohoid/ollama-runtime Shared Ollama LLM runtime for the GenAI Portfolio Suite. GPU-accelerated...	21	Experimental	—	Python
71	korkridake/GenAIOps-OSS A unified handbook for building, deploying and understanding LLM agents and...	21	Experimental	—	Python
72	mkhomutskyi/illama Ollama-like LLM experience for Intel Arc GPUs (B50/A770/A750) using...	21	Experimental	—	Python
73	ravikirankrishnaprasad/multi-agent-hallucination-detection-and-correction Multi-agent framework for hallucination detection and correction in LLM...	21	Experimental	—	Python
74	umbertocicciaa/devopsfix Fix cicd pipeline using generative AI	21	Experimental	—	TypeScript
75	Lavaver/OpenVINO-GenAI-Toolkit This repository provides a post-installation utility suite for OpenVINO,...	21	Experimental	—	Vue
76	budgetguard-ai/budgetguard-core A FinOps control plane for AI APIs - Drop-in API gateway that enforces hard...	20	Experimental	4	TypeScript
77	Shyam-Sundar-Raju/Consensus CONSENSUS — A learning-aware generative AI system using a multi-agent LLM...	20	Experimental	1	JavaScript
78	cwest/ai-tokentrace ai-tokentrace is a Python library for GenAI cost observability. It helps...	20	Experimental	1	Python
79	BabarAli93/GAIKube [TCCN 24] GAIKube: Generative AI-based Proactive Kubernetes Container...	20	Experimental	2	Jupyter Notebook
80	infinitum-nihil/otel-genai-safety-semconv Proposed OpenTelemetry semantic conventions for GenAI safety system telemetry	19	Experimental	—	—
81	svilupp/Spehulak.jl GenAI observability application in Julia	19	Experimental	3	Julia
82	bignacio/llama.up Provision your own LLMA backend on a public cloud provider	19	Experimental	3	HCL
83	RenaudGaudron/oeis-sequences-benchmark A Python toolkit and benchmark dataset for predicting the next term in OEIS...	18	Experimental	1	Python
84	RenaudGaudron/MMLU_benchmark An easy-to-use and standardised framework for evaluating Large Language...	18	Experimental	1	Python
85	ayush585/hallucination-detector Developed as part of IEM HackOsis 2.0 under Problem Statement HOGN02. Team...	18	Experimental	1	Python
86	vlimkv/ai-project-tracker Full-stack AI Project Manager with Self-Hosted LLM (llama.cpp). Generates...	18	Experimental	1	Python
87	traversaal-ai/DSBC-Data-Science-Task-Evaluation Benchmark and evaluate LLMs on data science code generation using the DSBC dataset.	18	Experimental	3	Jupyter Notebook
88	witchnya/easykubeai easy kubeai	18	Experimental	1	Python
89	glzbcrt/llm-tools-on-demand Use semantic queries to find relevant tools for LLM use.	17	Experimental	1	C#
90	samuli/rgltr Tool Governance for Pydantic AI Agents	17	Experimental	—	Python
91	devopscodegen/devopscodegen-common Common python modules for all devops code generators like pipeline code...	17	Experimental	1	Python
92	sharonccccc/AIFE_GEN-MLOps_Platform AI capability development platform using AutoML and AutoGluon	17	Experimental	7	Jupyter Notebook
93	sezer-muhammed/GenAIJury Framework for multi-agent LLM systems to evaluate, critique, and improve...	17	Experimental	1	Python
94	oliverweissl/SMOO A testing framework for ML systems	15	Experimental	—	Python
95	dileepkreddy5/secure-llm-gateway Production-grade AI security middleware with async micro-batching, prompt...	14	Experimental	1	Python
96	rupeshtiwari/pluralsight-reliability-slos-incident-management-gen-ai-systems Source code, demos, and supporting assets for a Pluralsight course on...	14	Experimental	1	Python
97	Dineshkumar0705/atlas-ai-observability Full-stack AI Trust & Observability Platform for LLM-based Systems (FastAPI...	14	Experimental	1	Python
98	tmam-dev/tmam tmam is an open-source observability platform that gives you deep, real-time...	14	Experimental	1	TypeScript
99	meyumer55/enterprise-foundational-model-scaler A high-level framework for fine-tuning and deploying foundational models...	14	Experimental	—	Python
100	kiquetal/course-zero-trust-fundamentals O'Reilly Live Course: Zero Trust Security Fundamentals — covering Zero Trust...	14	Experimental	—	—
101	Naresh1401/LLM-safety-guardrails Production-ready LLM safety layer: prompt injection detection, PII...	14	Experimental	—	Python
102	GauJosh/devops-genai Production-style GenAI platform lab for CI/CD failure analysis, including...	14	Experimental	—	Python
103	cathy841106/ai-hallucination-detect A tool for detecting hallucinations in domain-specific LLM outputs. It...	13	Experimental	—	Python
104	balavenkatesh3322/guardrails-demo LLM Security Project with Llama Guard	13	Experimental	10	Python
105	th3w1zard1/llm_fallbacks Aggregates, sorts, and organizes various GenAI LLM providers into...	13	Experimental	—	Python
106	sanika373/llm-data-quality-monitor Automated data quality monitoring using LLM (GPT-4o) to generate SQL checks...	13	Experimental	—	Python
107	alexei-led/cloud-inspector EXPERIMENT: Cloud Inspector identifies cloud resources based on user...	13	Experimental	—	Python
108	parthamehta123/cloudops-ai-monitor AI-powered CloudOps monitoring system — anomaly detection with PyTorch,...	13	Experimental	—	Python
109	nehamaheshh/LLM-Drift-Monitor Production-style LLM drift monitoring: semantic, structural, safety, and...	13	Experimental	—	Python
110	sachs7/guardrails_playground A HugginFace challenge to break the hidden models in giving up sensitive...	11	Experimental	—	—
111	CodeWithPraveen/ps-genai-hallucinations Course demos for identifying, mitigating, and preventing hallucinations in...	11	Experimental	—	Python
112	adumrewal/llm-api-gateway Gateway to control LLM API/SDK calls. Supports access to OpenAI, Azure,...	11	Experimental	4	Python
113	Brandon7CC/MODELFORGE Evaluate hosted OpenAI GPT / Google PaLM2/Gemini or local Ollama models...	11	Experimental	—	Python
114	bolticio/automl-templates This repository contains a collection of Automated Machine Learning (AutoML)...	11	Experimental	—	Jupyter Notebook
115	AlexRaudvee/CODEGEN-X-Evaluating-AI-for-Code-Completion. Benchmarking of the Code Completion models	11	Experimental	—	Python
116	billebel/splunk-community-ai A secure, governable AI gateway for Splunk with operational guardrails. An...	11	Experimental	2	Python
117	lalitkpal/VerifyAI VerifyAI is a simple UI application to test GenAI outputs	11	Experimental	2	Python
118	akhilreddy0703/ASRInferenceEngine This is a FastAPI-based server that acts as a interface between your...	10	Experimental	2	Python
119	MilosKosRadGit/ClozeTaskEvaluation This project evaluates Llama 3.2 3B continued pre-training for Serbian...	10	Experimental	1	Python