LLM Evaluation Platforms Generative AI Tools

Tools for testing, evaluating, and monitoring LLM applications in production—including automated evaluation frameworks, A/B testing, observability, quality control, and performance tracking. Does NOT include general ML ops platforms, code generation tools, or domain-specific AI applications.

There are 119 llm evaluation platforms tools tracked. 1 score above 70 (verified tier). The highest-rated is openvinotoolkit/model_server at 71/100 with 836 stars. 2 of the top 10 are actively maintained.

Get all 119 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=generative-ai&subcategory=llm-evaluation-platforms&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 openvinotoolkit/model_server

A scalable inference server for models optimized with OpenVINO™

71
Verified
2 madroidmaq/mlx-omni-server

MLX Omni Server is a local inference server powered by Apple's MLX...

64
Established
3 NVIDIA-NeMo/Guardrails

NeMo Guardrails is an open-source toolkit for easily adding programmable...

63
Established
4 generative-computing/mellea

Mellea is a library for writing generative programs.

58
Established
5 rhesis-ai/rhesis

Open-source platform & SDK for testing LLM and agentic apps. Define expected...

58
Established
6 taco-group/OpenEMMA

OpenEMMA, a permissively licensed open source "reproduction" of Waymo’s EMMA model.

57
Established
7 cncf/llm-starter-pack

🤖 Get started with LLMs on your kind cluster, today!

53
Established
8 modular/max-agentic-cookbook

MAX Agentic Cookbook

49
Emerging
9 cuckoo-network/cuckoo

Cuckoo is a Decentralized AI Model-Serving Platform, starting with...

48
Emerging
10 hichipli/vetting-python

A Python implementation of the VETTING (Verification and Evaluation Tool for...

47
Emerging
11 aws-samples/foundation-model-benchmarking-tool

Foundation model benchmarking tool. Run any model on any AWS platform and...

46
Emerging
12 AMDResearch/intelliperf

Automated bottleneck detection and solution orchestration

44
Emerging
13 clearml/clearml-fractional-gpu

ClearML Fractional GPU - Run multiple containers on the same GPU with driver...

44
Emerging
14 amazon-science/fmcore

Running Foundation Models at every scale, on every modality. Includes...

43
Emerging
15 sandner-art/ArtAgents

Framework for LLM based captioning and prompt engineering

42
Emerging
16 aimonlabs/aimon-python-sdk

This repo hosts the Python SDK and related examples for AIMon, which is a...

42
Emerging
17 Aaryanverma/trustifai

TrustifAI: A Comprehensive Framework for AI Trustworthiness

42
Emerging
18 jordanvolz/lolpop

A software engineering framework to jump start your machine learning projects

42
Emerging
19 vienneraphael/batchling

Save 50% off GenAI costs in two lines of code

40
Emerging
20 maximhq/maxim-cookbooks

Maxim is an end-to-end AI evaluation and observability platform that...

40
Emerging
21 yankeexe/ollama-manager

🦙 Manage Ollama models from your CLI!

38
Emerging
22 svilupp/Julia-LLM-Leaderboard

Provides a platform for the Julia community to compare AI models' abilities...

38
Emerging
23 kstathou/llm-stack

End-to-end tech stack for the LLM data flywheel

38
Emerging
24 soundstarrain/LLM-Filter-Probe

一款针对 LLM 输入侧审查的精确逆向分析工具。自动定位 NewAPI、OneAPI 及任何实施基于字典规则进行 Prompt 过滤的 API...

38
Emerging
25 autonomi-ai/nos

⚡️ A fast and flexible PyTorch inference server that runs locally, on any...

38
Emerging
26 Finoptimize/agentaflow-sro-community

Manage AI and Machine Learning workloads more efficiently with lower cost: ...

37
Emerging
27 amazon-science/concurry

Easy scaling for AI research and production workloads

37
Emerging
28 sMiNT0S/AIBugBench

From prompt to paste: evaluate AI / LLM output under a strict Python sandbox...

34
Emerging
29 retkowsky/foundry-local

Foundry Local is an on-device AI inference solution that you use to run AI...

33
Emerging
30 unit-mesh/devops-genius

DevOpsGenius 旨在结合 LLM 重塑软件开发中的 DevOps 实践。将 LLM 视为团队的初级...

33
Emerging
31 llm-platform-security/gpt-data-exposure

An In-Depth Investigation of Data Collection in LLM App Ecosystems

33
Emerging
32 rpjayaraman/LLMxVLSI

Generate, Simulate & Summarize Verilog Code with GenAI and Iverilog tool

32
Emerging
33 Generative-Engine-Marketing/GEM-Bench

First comprehensive benchmark for Generative Engine Marketing (GEM), an...

32
Emerging
34 LLMConsent/llmconsent-standards

LLMConsent is an open protocol that establishes standards for managing...

31
Emerging
35 hiamitabha/genai-bench

Code to benchmark APIs available from LLM vendors and demostrate how they work

31
Emerging
36 djokester/groqeval

Use groq for evaluations

31
Emerging
37 fmind/mlops-digester

A tool equipping Pydantic AI agents with the ability to digest and summarize...

31
Emerging
38 iservicebus/lmaas

LMaaS (Language Model as a Service) abstracts away complexities and enables...

30
Emerging
39 nginH/llmforge

One API, every AI model, instant switching. Change from GPT-4 to Gemini to...

30
Emerging
40 SAP-samples/llm-round-trip-correctness

This repo provides code for evaluation of llm round-trip-correctness on text...

30
Emerging
41 verma-kunal/k8sGPT-tutorial

This repo is dedicated for the K8sGPT tutorial on Kubesimplify's YT channel.

29
Experimental
42 evalops/eval2otel

Library to convert AI evaluation results to OpenTelemetry GenAI semantic...

29
Experimental
43 danilop/llm-test-mate

A simple testing framework to evaluate and validate LLM-generated content...

28
Experimental
44 maharshijani05/CivicMind

CivicMind is an AI-powered civic policy simulator where intelligent agents...

28
Experimental
45 nyno-ai/nynoflow

Production grade framework for LLM application development

27
Experimental
46 robocorp/llmfoo

Code with the flow of a river, refactor with the grace of a breeze, and...

27
Experimental
47 demml/potatohead

🥔 Quality control for your potato heads (LLMs)

27
Experimental
48 Portkey-AI/helm-chart

Kubernetes Configs for Portkey Gateway deployment

27
Experimental
49 Yapakayala/cloudops-ai-monitor

🔍 Monitor cloud environments with AI-driven insights, anomaly detection, and...

26
Experimental
50 noct-ml/noesis

Noesis - A lightweight toolkit for inspecting transformer internals through...

26
Experimental
51 paralleliq/piqc-knowledge-base

Production-ready checklists and frameworks for deploying LLMs, GenAI models,...

25
Experimental
52 hipvlady/subzero

Project SubZeo: Zero Trust AI Gateway (ZTAG)

24
Experimental
53 Tradunsky/3D-guardrails

3D content you can trust

24
Experimental
54 AdityaPatange1/okesa

Okesa: LLM-powered Natural Language Processing! 💬

24
Experimental
55 Ashik245-commits/LLM-Filter-Probe

🕵️♂️ Analyze and reverse engineer keyword filtering in large language models...

24
Experimental
56 sugihAF/DomainBench

LLM Benchmark and Comparison on Domain Specific Implementation

24
Experimental
57 radlab-dev-group/llm-router-plugins

A companion repository for llm-router containing a collection of...

24
Experimental
58 krish567366/automl_self_improvement

A next-gen toolkit for autonomous machine learning that automatically...

23
Experimental
59 ozanunal0/Prometheus-Gateway

An open-source, security-first LLM Gateway designed to provide a unified,...

22
Experimental
60 josephlash10-svg/Glass-Box

A Python-based framework for managing LLM drift and preventing model...

22
Experimental
61 last9/python-ai-sdk

OpenTelemetry extension for LLM observability - track conversations,...

22
Experimental
62 valohai/valohai-llm

Track and report LLM and GenAI evaluations to Valohai LLM

22
Experimental
63 leaxer-ai/leaxer

An engine for local AI inference, built on Elixir and the BEAM virtual machine.

22
Experimental
64 SangiSI/llm-model-selection-lab

Decision-centric evaluation lab for intelligent LLM model selection using...

22
Experimental
65 eneagizzarelli/SYNAPSE

SYNAPSE (SYNthetic AI Pot for Security Enhancement) and SYNAPSE-to-MITRE...

22
Experimental
66 Mrdodo446/ModelForge

Build and customize machine learning models efficiently with an open-source...

22
Experimental
67 mauryasameer/llm_eval

SR 11-7 & EU AI Act compliant LLM validation framework for financial...

22
Experimental
68 svilupp/Logfire.jl

Observability for Julia LLM applications. Know what your AI is doing.

21
Experimental
69 hari7261/indus-llm-gateway

Production-ready LLM gateway — unified OpenAI-compatible API for all...

21
Experimental
70 adityonugrohoid/ollama-runtime

Shared Ollama LLM runtime for the GenAI Portfolio Suite. GPU-accelerated...

21
Experimental
71 korkridake/GenAIOps-OSS

A unified handbook for building, deploying and understanding LLM agents and...

21
Experimental
72 mkhomutskyi/illama

Ollama-like LLM experience for Intel Arc GPUs (B50/A770/A750) using...

21
Experimental
73 ravikirankrishnaprasad/multi-agent-hallucination-detection-and-correction

Multi-agent framework for hallucination detection and correction in LLM...

21
Experimental
74 umbertocicciaa/devopsfix

Fix cicd pipeline using generative AI

21
Experimental
75 Lavaver/OpenVINO-GenAI-Toolkit

This repository provides a post-installation utility suite for OpenVINO,...

21
Experimental
76 budgetguard-ai/budgetguard-core

A FinOps control plane for AI APIs - Drop-in API gateway that enforces hard...

20
Experimental
77 Shyam-Sundar-Raju/Consensus

CONSENSUS — A learning-aware generative AI system using a multi-agent LLM...

20
Experimental
78 cwest/ai-tokentrace

ai-tokentrace is a Python library for GenAI cost observability. It helps...

20
Experimental
79 BabarAli93/GAIKube

[TCCN 24] GAIKube: Generative AI-based Proactive Kubernetes Container...

20
Experimental
80 infinitum-nihil/otel-genai-safety-semconv

Proposed OpenTelemetry semantic conventions for GenAI safety system telemetry

19
Experimental
81 svilupp/Spehulak.jl

GenAI observability application in Julia

19
Experimental
82 bignacio/llama.up

Provision your own LLMA backend on a public cloud provider

19
Experimental
83 RenaudGaudron/oeis-sequences-benchmark

A Python toolkit and benchmark dataset for predicting the next term in OEIS...

18
Experimental
84 RenaudGaudron/MMLU_benchmark

An easy-to-use and standardised framework for evaluating Large Language...

18
Experimental
85 ayush585/hallucination-detector

Developed as part of IEM HackOsis 2.0 under Problem Statement HOGN02. Team...

18
Experimental
86 vlimkv/ai-project-tracker

Full-stack AI Project Manager with Self-Hosted LLM (llama.cpp). Generates...

18
Experimental
87 traversaal-ai/DSBC-Data-Science-Task-Evaluation

Benchmark and evaluate LLMs on data science code generation using the DSBC dataset.

18
Experimental
88 witchnya/easykubeai

easy kubeai

18
Experimental
89 glzbcrt/llm-tools-on-demand

Use semantic queries to find relevant tools for LLM use.

17
Experimental
90 samuli/rgltr

Tool Governance for Pydantic AI Agents

17
Experimental
91 devopscodegen/devopscodegen-common

Common python modules for all devops code generators like pipeline code...

17
Experimental
92 sharonccccc/AIFE_GEN-MLOps_Platform

AI capability development platform using AutoML and AutoGluon

17
Experimental
93 sezer-muhammed/GenAIJury

Framework for multi-agent LLM systems to evaluate, critique, and improve...

17
Experimental
94 oliverweissl/SMOO

A testing framework for ML systems

15
Experimental
95 dileepkreddy5/secure-llm-gateway

Production-grade AI security middleware with async micro-batching, prompt...

14
Experimental
96 rupeshtiwari/pluralsight-reliability-slos-incident-management-gen-ai-systems

Source code, demos, and supporting assets for a Pluralsight course on...

14
Experimental
97 Dineshkumar0705/atlas-ai-observability

Full-stack AI Trust & Observability Platform for LLM-based Systems (FastAPI...

14
Experimental
98 tmam-dev/tmam

tmam is an open-source observability platform that gives you deep, real-time...

14
Experimental
99 meyumer55/enterprise-foundational-model-scaler

A high-level framework for fine-tuning and deploying foundational models...

14
Experimental
100 kiquetal/course-zero-trust-fundamentals

O'Reilly Live Course: Zero Trust Security Fundamentals — covering Zero Trust...

14
Experimental
101 Naresh1401/LLM-safety-guardrails

Production-ready LLM safety layer: prompt injection detection, PII...

14
Experimental
102 GauJosh/devops-genai

Production-style GenAI platform lab for CI/CD failure analysis, including...

14
Experimental
103 cathy841106/ai-hallucination-detect

A tool for detecting hallucinations in domain-specific LLM outputs. It...

13
Experimental
104 balavenkatesh3322/guardrails-demo

LLM Security Project with Llama Guard

13
Experimental
105 th3w1zard1/llm_fallbacks

Aggregates, sorts, and organizes various GenAI LLM providers into...

13
Experimental
106 sanika373/llm-data-quality-monitor

Automated data quality monitoring using LLM (GPT-4o) to generate SQL checks...

13
Experimental
107 alexei-led/cloud-inspector

EXPERIMENT: Cloud Inspector identifies cloud resources based on user...

13
Experimental
108 parthamehta123/cloudops-ai-monitor

AI-powered CloudOps monitoring system — anomaly detection with PyTorch,...

13
Experimental
109 nehamaheshh/LLM-Drift-Monitor

Production-style LLM drift monitoring: semantic, structural, safety, and...

13
Experimental
110 sachs7/guardrails_playground

A HugginFace challenge to break the hidden models in giving up sensitive...

11
Experimental
111 CodeWithPraveen/ps-genai-hallucinations

Course demos for identifying, mitigating, and preventing hallucinations in...

11
Experimental
112 adumrewal/llm-api-gateway

Gateway to control LLM API/SDK calls. Supports access to OpenAI, Azure,...

11
Experimental
113 Brandon7CC/MODELFORGE

Evaluate hosted OpenAI GPT / Google PaLM2/Gemini or local Ollama models...

11
Experimental
114 bolticio/automl-templates

This repository contains a collection of Automated Machine Learning (AutoML)...

11
Experimental
115 AlexRaudvee/CODEGEN-X-Evaluating-AI-for-Code-Completion.

Benchmarking of the Code Completion models

11
Experimental
116 billebel/splunk-community-ai

A secure, governable AI gateway for Splunk with operational guardrails. An...

11
Experimental
117 lalitkpal/VerifyAI

VerifyAI is a simple UI application to test GenAI outputs

11
Experimental
118 akhilreddy0703/ASRInferenceEngine

This is a FastAPI-based server that acts as a interface between your...

10
Experimental
119 MilosKosRadGit/ClozeTaskEvaluation

This project evaluates Llama 3.2 3B continued pre-training for Serbian...

10
Experimental