All AI Evaluation Tools

216 tools ranked by quality score · Page 3 of 3

Showing 201–216 of 216
# Tool Score Tier
201 maxi4youuu/RePRo

🧠 Enhance raw prompts into optimized, powerful versions for AI tools like...

26
Experimental
202 Anarv2104/Inflion

Observability and influence tracing infrastructure for multi-agent AI systems.

26
Experimental
203 HiThink-Research/FinMTM

[ACL 2026] FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning...

25
Experimental
204 fourdollars/cella

A terminal UI and CLI for managing and monitoring LXD + Docker containers —...

25
Experimental
205 FelixBroesamle/s2mflow

Meta-generator: generating multicommodity flow instances from...

24
Experimental
206 iazaran/trace-replay

High-fidelity process tracking, deterministic replay, and AI-powered...

24
Experimental
207 Basaltlabs-app/Gauntlet

Community-driven behavioral reliability benchmark for LLMs. 88 probes across...

24
Experimental
208 SagarMaheshwary/reqlog

Fast CLI to search and trace logs across services or single files using...

24
Experimental
209 TomasVenkrbec/lazyline

Zero-config line-level Python profiler. No decorators, no code changes....

24
Experimental
210 0xMilord/better-logger

Execution flow debugger for modern apps. Turn scattered `console.log` calls...

24
Experimental
211 vikpant/strategic-coopetition

Coopetition-Gym: A research-grade mixed-motive multi-agent reinforcement...

23
Experimental
212 bajajku/VAC

Develop and evaluate a trauma-informed LLM-based chatbot that is...

22
Experimental
213 parsamivehchi/tps.sh

tps.sh — Tokens Per Second LLM Benchmark. 7 models, 147 tests, 21 prompts...

18
Experimental
214 Zxela/claude-monitor

Real-time dashboard for monitoring Claude Code sessions — live token usage,...

16
Experimental
215 pilhuhn/otel-oql

An experiment in creating a OpenTelemetry backend

16
Experimental
216 MarkIvor/officeiq

Исследовательский вопрос: можно ли измерить «офисный интеллект» LLM? Попытка...

15
Experimental
« Prev 1 2 3