RobustNLP/DeRTa

A novel approach to improve the safety of large language models, enabling them to transition effectively from unsafe to safe state.

/ 100

Emerging

This project helps make large language models (LLMs) safer by training them to refuse harmful requests more effectively. It takes an existing LLM and specialized training data, then outputs a refined LLM that is better at identifying and declining unsafe prompts. This tool is designed for AI safety researchers and developers who are responsible for ensuring their LLM applications are secure and reliable.

No commits in the last 6 months.

Use this if you need to improve the safety of a Large Language Model, making it more robust at refusing to generate harmful or inappropriate content.

Not ideal if your primary goal is to enhance the model's performance on general tasks rather than its safety refusal capabilities.

AI Safety Large Language Models Content Moderation Model Refusal Responsible AI

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 9 / 25

Maturity 16 / 25

Community 4 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Higher-rated alternatives

wuyoscar/ISC-Bench

Internal Safety Collapse: Turning LLMs into a "Jailbroken State" Without "a Jailbreak Attack".

yueliu1999/Awesome-Jailbreak-on-LLMs

Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, exciting jailbreak methods...

yiksiu-chan/SpeakEasy

[ICML 2025] Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions

xirui-li/DrAttack

Official implementation of paper: DrAttack: Prompt Decomposition and Reconstruction Makes...

tmlr-group/DeepInception

[arXiv:2311.03191] "DeepInception: Hypnotize Large Language Model to Be Jailbreaker"

Explore LLM Tools

All categories Trending LLM Tool directory Insights