RobustNLP/DeRTa
A novel approach to improve the safety of large language models, enabling them to transition effectively from unsafe to safe state.
This project helps make large language models (LLMs) safer by training them to refuse harmful requests more effectively. It takes an existing LLM and specialized training data, then outputs a refined LLM that is better at identifying and declining unsafe prompts. This tool is designed for AI safety researchers and developers who are responsible for ensuring their LLM applications are secure and reliable.
No commits in the last 6 months.
Use this if you need to improve the safety of a Large Language Model, making it more robust at refusing to generate harmful or inappropriate content.
Not ideal if your primary goal is to enhance the model's performance on general tasks rather than its safety refusal capabilities.
Stars
72
Forks
2
Language
Python
License
MIT
Category
Last pushed
May 22, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/RobustNLP/DeRTa"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
wuyoscar/ISC-Bench
Internal Safety Collapse: Turning LLMs into a "Jailbroken State" Without "a Jailbreak Attack".
yueliu1999/Awesome-Jailbreak-on-LLMs
Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, exciting jailbreak methods...
yiksiu-chan/SpeakEasy
[ICML 2025] Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions
xirui-li/DrAttack
Official implementation of paper: DrAttack: Prompt Decomposition and Reconstruction Makes...
tmlr-group/DeepInception
[arXiv:2311.03191] "DeepInception: Hypnotize Large Language Model to Be Jailbreaker"