Safety Robustness Evaluation LLM Tools

Tools for assessing LLM trustworthiness, safety, robustness, and reliability through benchmarks, red-teaming, adversarial testing, and fault analysis. Does NOT include general performance benchmarks, domain-specific task evaluation, or code generation quality metrics.

There are 17 safety robustness evaluation tools tracked. 2 score above 50 (established tier). The highest-rated is microsoft/OpenRCA at 53/100 with 292 stars.

Get all 17 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=safety-robustness-evaluation&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 microsoft/OpenRCA

[ICLR'25] OpenRCA: Can Large Language Models Locate the Root Cause of...

53
Established
2 PacificAI/langtest

Deliver safe & effective language models

53
Established
3 Babelscape/ALERT

Official repository for the paper "ALERT: A Comprehensive Benchmark for...

39
Emerging
4 TrustGen/TrustEval-toolkit

[ICLR'26, NAACL'25 Demo] Toolkit & Benchmark for evaluating the...

39
Emerging
5 ChenWu98/agent-attack

[ICLR 2025] Dissecting adversarial robustness of multimodal language model agents

36
Emerging
6 Trust4AI/ASTRAL

Automated Safety Testing of Large Language Models

34
Emerging
7 ast-fortiss-tum/STELLAR

STELLAR: A Search-Based Testing Framework for Large Language Model...

32
Emerging
8 zy-ning/LinguaSafe

The official github repo for [LinguaSafe paper](https://arxiv.org/abs/2508.12733)

32
Emerging
9 rumaisa-azeem/llm-robots-discrimination-safety

Code and evaluation framework for assessing discrimination risks of LLMs in...

27
Experimental
10 thtskaran/context_window_research

80,433-trial study of context-window sycophancy across 6 LLMs (4B–72B)....

26
Experimental
11 exalsius/rca-llm

An evaluation framework for root cause analysis in large-scale LLM inference systems

25
Experimental
12 echo-veil/ratchet-pilot

Pilot study data for The Ratchet Effect: Asymmetric Self-Description in...

21
Experimental
13 echo-veil/echoveil-methodology

Replication materials for The Permission Effect: How Non-Anthropomorphic...

21
Experimental
14 C-you-know/Action-Based-LLM-Testing-Harness

Ranking Large Language Models using the Principle of Least Action! Built...

21
Experimental
15 AndyChiangSH/BADGE

Code for our paper, "BADGE: BADminton report Generation and Evaluation with...

21
Experimental
16 burcgokden/lm-evaluation-harness-with-PLDR-LLM-kvg-cache

Fork of LM Evaluation Harness Suite for evaluating benchmarks in paper...

20
Experimental
17 JY0284/code_completion_as_human_action_prediction

This repository contains the core methods and models described in the paper...

16
Experimental

Comparisons in this category