yueliu1999/FlipAttack

[ICML 2025] An official source code for paper "FlipAttack: Jailbreak LLMs via Flipping".

/ 100

Emerging

This project offers a method to test the robustness of large language models (LLMs) against 'jailbreak' attempts. It takes a potentially harmful prompt and modifies it by adding specific 'noise' to the beginning, which tricks the LLM into generating undesirable content that its safety guardrails would normally prevent. This is designed for AI safety researchers or red teamers who need to evaluate how easily LLMs can be manipulated.

167 stars. No commits in the last 6 months.

Use this if you need to rigorously test and identify vulnerabilities in the safety mechanisms of black-box LLMs by seeing how they respond to cleverly disguised harmful prompts.

Not ideal if you are looking for a tool to develop or improve LLM safety filters, as this focuses on attacking existing ones.

AI Safety Testing LLM Red Teaming Generative AI Security Prompt Engineering Model Vulnerability Assessment

No License Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 10 / 25

Maturity 8 / 25

Community 12 / 25

How are scores calculated?

Stars

167

Forks

Language

Python

License

—

Higher-rated alternatives

UCSB-NLP-Chang/SemanticSmooth

Implementation of paper 'Defending Large Language Models against Jailbreak Attacks via Semantic...

sigeisler/reinforce-attacks-llms

REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and...

DAMO-NLP-SG/multilingual-safety-for-LLMs

[ICLR 2024]Data for "Multilingual Jailbreak Challenges in Large Language Models"

vicgalle/merging-self-critique-jailbreaks

"Merging Improves Self-Critique Against Jailbreak Attacks", code and models

wanglne/DELMAN

[ACL 2025 Findings] DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing

Explore Transformer Models

All categories Trending Transformer directory Insights