wanglne/DELMAN

[ACL 2025 Findings] DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing

/ 100

Experimental

This project helps AI safety engineers and developers protect large language models (LLMs) from 'jailbreaking' attacks. It takes an existing LLM and applies a dynamic defense mechanism to reduce its susceptibility to malicious prompts while maintaining its ability to perform well on benign tasks. The output is a more robust, secure LLM.

No commits in the last 6 months.

Use this if you are a machine learning engineer or researcher responsible for the safety and security of LLMs and need to mitigate jailbreaking attempts without compromising model performance on legitimate queries.

Not ideal if you are looking for a plug-and-play solution for end-users or do not have experience with LLM deployment and model editing techniques.

LLM security AI safety model robustness red teaming prompt engineering defense

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 5 / 25

Maturity 15 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

Python

License

MIT

Higher-rated alternatives

UCSB-NLP-Chang/SemanticSmooth

Implementation of paper 'Defending Large Language Models against Jailbreak Attacks via Semantic...

sigeisler/reinforce-attacks-llms

REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and...

DAMO-NLP-SG/multilingual-safety-for-LLMs

[ICLR 2024]Data for "Multilingual Jailbreak Challenges in Large Language Models"

yueliu1999/FlipAttack

[ICML 2025] An official source code for paper "FlipAttack: Jailbreak LLMs via Flipping".

vicgalle/merging-self-critique-jailbreaks

"Merging Improves Self-Critique Against Jailbreak Attacks", code and models

Explore Transformer Models

All categories Trending Transformer directory Insights