wanglne/DELMAN
[ACL 2025 Findings] DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing
This project helps AI safety engineers and developers protect large language models (LLMs) from 'jailbreaking' attacks. It takes an existing LLM and applies a dynamic defense mechanism to reduce its susceptibility to malicious prompts while maintaining its ability to perform well on benign tasks. The output is a more robust, secure LLM.
No commits in the last 6 months.
Use this if you are a machine learning engineer or researcher responsible for the safety and security of LLMs and need to mitigate jailbreaking attempts without compromising model performance on legitimate queries.
Not ideal if you are looking for a plug-and-play solution for end-users or do not have experience with LLM deployment and model editing techniques.
Stars
9
Forks
—
Language
Python
License
MIT
Category
Last pushed
May 27, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/wanglne/DELMAN"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
UCSB-NLP-Chang/SemanticSmooth
Implementation of paper 'Defending Large Language Models against Jailbreak Attacks via Semantic...
sigeisler/reinforce-attacks-llms
REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and...
DAMO-NLP-SG/multilingual-safety-for-LLMs
[ICLR 2024]Data for "Multilingual Jailbreak Challenges in Large Language Models"
yueliu1999/FlipAttack
[ICML 2025] An official source code for paper "FlipAttack: Jailbreak LLMs via Flipping".
vicgalle/merging-self-critique-jailbreaks
"Merging Improves Self-Critique Against Jailbreak Attacks", code and models