All AI Evaluation Tools

216 tools ranked by quality score · Page 3 of 3

Showing 201–216 of 216

« Prev

#	Tool	Score	Tier	Category	Stars	Language
201	maxi4youuu/RePRo 🧠 Enhance raw prompts into optimized, powerful versions for AI tools like...	26	Experimental	—	2	TypeScript
202	Anarv2104/Inflion Observability and influence tracing infrastructure for multi-agent AI systems.	26	Experimental	—	2	Python
203	HiThink-Research/FinMTM [ACL 2026] FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning...	25	Experimental	—	25	Python
204	fourdollars/cella A terminal UI and CLI for managing and monitoring LXD + Docker containers —...	25	Experimental	—	3	Go
205	FelixBroesamle/s2mflow Meta-generator: generating multicommodity flow instances from...	24	Experimental	—	2	Rust
206	iazaran/trace-replay High-fidelity process tracking, deterministic replay, and AI-powered...	24	Experimental	—	2	PHP
207	Basaltlabs-app/Gauntlet Community-driven behavioral reliability benchmark for LLMs. 88 probes across...	24	Experimental	—	2	Python
208	SagarMaheshwary/reqlog Fast CLI to search and trace logs across services or single files using...	24	Experimental	—	2	Go
209	TomasVenkrbec/lazyline Zero-config line-level Python profiler. No decorators, no code changes....	24	Experimental	—	2	Python
210	0xMilord/better-logger Execution flow debugger for modern apps. Turn scattered `console.log` calls...	24	Experimental	—	2	TypeScript
211	vikpant/strategic-coopetition Coopetition-Gym: A research-grade mixed-motive multi-agent reinforcement...	23	Experimental	—	2	Python
212	bajajku/VAC Develop and evaluate a trauma-informed LLM-based chatbot that is...	22	Experimental	—	2	Python
213	parsamivehchi/tps.sh tps.sh — Tokens Per Second LLM Benchmark. 7 models, 147 tests, 21 prompts...	18	Experimental	—	2	Python
214	Zxela/claude-monitor Real-time dashboard for monitoring Claude Code sessions — live token usage,...	16	Experimental	—	2	Go
215	pilhuhn/otel-oql An experiment in creating a OpenTelemetry backend	16	Experimental	—	2	Go
216	MarkIvor/officeiq Исследовательский вопрос: можно ли измерить «офисный интеллект» LLM? Попытка...	15	Experimental	—	2	HTML

« Prev 1 2 3