UCSB-NLP-Chang/SemanticSmooth
Implementation of paper 'Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing'
This project helps protect large language models (LLMs) from 'jailbreak' attacks, where users try to bypass safety measures. It takes an LLM and a potential malicious prompt as input, then processes the prompt to make it safe, outputting a modified prompt that the LLM can safely respond to. This is for AI safety researchers or developers deploying LLMs who need to strengthen their models against misuse.
No commits in the last 6 months.
Use this if you are responsible for the security and ethical deployment of large language models and need a method to defend against adversarial prompts.
Not ideal if you are looking for a general-purpose content filter or a solution for managing data privacy within your LLM.
Stars
23
Forks
5
Language
Python
License
MIT
Category
Last pushed
Jun 09, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/UCSB-NLP-Chang/SemanticSmooth"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related models
sigeisler/reinforce-attacks-llms
REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and...
DAMO-NLP-SG/multilingual-safety-for-LLMs
[ICLR 2024]Data for "Multilingual Jailbreak Challenges in Large Language Models"
yueliu1999/FlipAttack
[ICML 2025] An official source code for paper "FlipAttack: Jailbreak LLMs via Flipping".
vicgalle/merging-self-critique-jailbreaks
"Merging Improves Self-Critique Against Jailbreak Attacks", code and models
wanglne/DELMAN
[ACL 2025 Findings] DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing