Safety Robustness Evaluation LLM Tools

Tools for assessing LLM trustworthiness, safety, robustness, and reliability through benchmarks, red-teaming, adversarial testing, and fault analysis. Does NOT include general performance benchmarks, domain-specific task evaluation, or code generation quality metrics.

There are 17 safety robustness evaluation tools tracked. 2 score above 50 (established tier). The highest-rated is microsoft/OpenRCA at 53/100 with 292 stars.

Get all 17 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=safety-robustness-evaluation&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	microsoft/OpenRCA [ICLR'25] OpenRCA: Can Large Language Models Locate the Root Cause of...	53	Established	292	Python
2	PacificAI/langtest Deliver safe & effective language models	53	Established	552	Python
3	Babelscape/ALERT Official repository for the paper "ALERT: A Comprehensive Benchmark for...	39	Emerging	57	Python
4	TrustGen/TrustEval-toolkit [ICLR'26, NAACL'25 Demo] Toolkit & Benchmark for evaluating the...	39	Emerging	128	Python
5	ChenWu98/agent-attack [ICLR 2025] Dissecting adversarial robustness of multimodal language model agents	36	Emerging	132	Python
6	Trust4AI/ASTRAL Automated Safety Testing of Large Language Models	34	Emerging	18	Python
7	ast-fortiss-tum/STELLAR STELLAR: A Search-Based Testing Framework for Large Language Model...	32	Emerging	1	Python
8	zy-ning/LinguaSafe The official github repo for [LinguaSafe paper](https://arxiv.org/abs/2508.12733)	32	Emerging	5	Python
9	rumaisa-azeem/llm-robots-discrimination-safety Code and evaluation framework for assessing discrimination risks of LLMs in...	27	Experimental	7	Python
10	thtskaran/context_window_research 80,433-trial study of context-window sycophancy across 6 LLMs (4B–72B)....	26	Experimental	2	Python
11	exalsius/rca-llm An evaluation framework for root cause analysis in large-scale LLM inference systems	25	Experimental	5	Python
12	echo-veil/ratchet-pilot Pilot study data for The Ratchet Effect: Asymmetric Self-Description in...	21	Experimental	—	Python
13	echo-veil/echoveil-methodology Replication materials for The Permission Effect: How Non-Anthropomorphic...	21	Experimental	—	—
14	C-you-know/Action-Based-LLM-Testing-Harness Ranking Large Language Models using the Principle of Least Action! Built...	21	Experimental	5	Python
15	AndyChiangSH/BADGE Code for our paper, "BADGE: BADminton report Generation and Evaluation with...	21	Experimental	9	Python
16	burcgokden/lm-evaluation-harness-with-PLDR-LLM-kvg-cache Fork of LM Evaluation Harness Suite for evaluating benchmarks in paper...	20	Experimental	5	Python
17	JY0284/code_completion_as_human_action_prediction This repository contains the core methods and models described in the paper...	16	Experimental	55	Python

Comparisons in this category

OpenRCA and rca-llm (53 vs 25)