UCSB-NLP-Chang/SemanticSmooth

Implementation of paper 'Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing'

/ 100

Emerging

This project helps protect large language models (LLMs) from 'jailbreak' attacks, where users try to bypass safety measures. It takes an LLM and a potential malicious prompt as input, then processes the prompt to make it safe, outputting a modified prompt that the LLM can safely respond to. This is for AI safety researchers or developers deploying LLMs who need to strengthen their models against misuse.

No commits in the last 6 months.

Use this if you are responsible for the security and ethical deployment of large language models and need a method to defend against adversarial prompts.

Not ideal if you are looking for a general-purpose content filter or a solution for managing data privacy within your LLM.

AI Safety Large Language Models Adversarial Attacks LLM Security Prompt Engineering

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 6 / 25

Maturity 16 / 25

Community 15 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Related models

sigeisler/reinforce-attacks-llms

REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and...

DAMO-NLP-SG/multilingual-safety-for-LLMs

[ICLR 2024]Data for "Multilingual Jailbreak Challenges in Large Language Models"

yueliu1999/FlipAttack

[ICML 2025] An official source code for paper "FlipAttack: Jailbreak LLMs via Flipping".

vicgalle/merging-self-critique-jailbreaks

"Merging Improves Self-Critique Against Jailbreak Attacks", code and models

wanglne/DELMAN

[ACL 2025 Findings] DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing

Explore Transformer Models

All categories Trending Transformer directory Insights