LLM Observability Platforms Prompt Engineering Tools

Tools for monitoring, tracing, evaluating, and debugging LLM applications in production. Includes end-to-end observability, real-time metrics, automated evals, and prompt management dashboards. Does NOT include general application monitoring, synthetic data generation, or agent training frameworks.

There are 27 llm observability platforms tools tracked. 3 score above 70 (verified tier). The highest-rated is langfuse/langfuse at 82/100 with 23,106 stars. 6 of the top 10 are actively maintained.

Get all 27 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=prompt-engineering&subcategory=llm-observability-platforms&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	langfuse/langfuse 🪢 Open source LLM engineering platform: LLM Observability, metrics, evals,...	82	Verified	23,106	TypeScript
2	Arize-ai/phoenix AI Observability & Evaluation	81	Verified	8,847	Jupyter Notebook
3	Mirascope/mirascope The LLM Anti-Framework	74	Verified	1,425	Python
4	Agenta-AI/agenta The open-source LLMOps platform: prompt playground, prompt management, LLM...	69	Established	3,923	TypeScript
5	Helicone/helicone 🧊 Open source LLM observability platform. One line of code to monitor,...	68	Established	5,237	TypeScript
6	algorithmicsuperintelligence/optillm Optimizing inference proxy for LLMs	62	Established	3,377	Python
7	TensorOpsAI/LLMstudio Framework to bring LLM applications to production	61	Established	371	Python
8	Scale3-Labs/langtrace Langtrace 🔍 is an open-source, Open Telemetry based end-to-end...	51	Established	1,184	TypeScript
9	langfuse/langfuse-java 🪢 Auto-generated Java Client for Langfuse API	49	Emerging	52	Java
10	AnchoringAI/anchoring-ai An open-source no-code tool for teams to collaborate on building,...	46	Emerging	155	JavaScript
11	whylabs/langkit 🔍 LangKit: An open-source toolkit for monitoring Large Language Models...	43	Emerging	976	Jupyter Notebook
12	TrentPierce/PolyCouncil PolyCouncil is an open-source multi-model deliberation engine for LM Studio....	40	Emerging	26	Python
13	tenemos/langwatch The open LLM Ops platform - Traces, Analytics, Evaluations, Datasets and...	39	Emerging	2	TypeScript
14	brokle-ai/brokle The AI engineering platform for AI teams. Observability, evaluation, and...	37	Emerging	3	Go
15	ksm26/Evaluating-AI-Agents A hands-on course repository for Evaluating AI Agents, created with Arize...	23	Experimental	1	Jupyter Notebook
16	chirindaopensource/multi_agent_system_architecture_for_federal_funds_target_rate_prediction End-to-End Python implementation of "FedSight AI" multi-agent system for...	22	Experimental	4	Jupyter Notebook
17	MagicTeaMC/dnsLM dnsLM: Where AI meets DNS—because even domains deserve a little intelligence!	19	Experimental	3	Python
18	rahatmoktadir03/llm-evaluation-platform A full-stack web application for comparing and analyzing the performance of...	17	Experimental	1	TypeScript
19	promplate/trace integrate with @langfuse or langsmith - plug-and-play observability for @promplate	15	Experimental	1	Python
20	Uplay111/Loki-s-Insight- A lightweight visual dashboard to inspect and edit OpenClaw AI agent memory...	14	Experimental	1	HTML
21	Tarunjit45/ModelPulse ModelPulse helps maintain model reliability and performance by providing...	14	Experimental	1	Python
22	VicRejkia/LLM-Sherpa A Python GUI tool to package a codebase into a single, context-rich Markdown...	13	Experimental	—	Python
23	alhemdrew/self-hosted-llm-infrastructure Deployment of a self-hosted LLM infrastructure using Ollama and Open WebUI...	13	Experimental	—	—
24	marco-ruiz/llm-repo Framework that translates LLM responses to structured data models	13	Experimental	—	Java
25	vshwsh/prod-evals-cookbook 🎯 Build effective AI evaluations through a hands-on tutorial, using a...	13	Experimental	—	Python
26	airfold/airlang ⚡ From Zero to Monitoring LLMs in 5 minutes ⚡	12	Experimental	5	Python
27	tooniez/llm-toolkit 🛠️ A collection of prompts, tools and functions to provide researchers with...	11	Experimental	—	Python

Comparisons in this category

langfuse and phoenix (82 vs 81) langfuse and helicone (82 vs 68) langfuse and agenta (82 vs 69) langfuse and langtrace (82 vs 51) langfuse and langkit (82 vs 43) langfuse and LLMstudio (82 vs 61) langfuse and langfuse-java (82 vs 49) langfuse and brokle (82 vs 37) langfuse and langwatch (82 vs 39) phoenix and helicone (81 vs 68)