yueliu1999/FlipAttack

[ICML 2025] An official source code for paper "FlipAttack: Jailbreak LLMs via Flipping".

32
/ 100
Emerging

This project offers a method to test the robustness of large language models (LLMs) against 'jailbreak' attempts. It takes a potentially harmful prompt and modifies it by adding specific 'noise' to the beginning, which tricks the LLM into generating undesirable content that its safety guardrails would normally prevent. This is designed for AI safety researchers or red teamers who need to evaluate how easily LLMs can be manipulated.

167 stars. No commits in the last 6 months.

Use this if you need to rigorously test and identify vulnerabilities in the safety mechanisms of black-box LLMs by seeing how they respond to cleverly disguised harmful prompts.

Not ideal if you are looking for a tool to develop or improve LLM safety filters, as this focuses on attacking existing ones.

AI Safety Testing LLM Red Teaming Generative AI Security Prompt Engineering Model Vulnerability Assessment
No License Stale 6m No Package No Dependents
Maintenance 2 / 25
Adoption 10 / 25
Maturity 8 / 25
Community 12 / 25

How are scores calculated?

Stars

167

Forks

13

Language

Python

License

Last pushed

May 02, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/yueliu1999/FlipAttack"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.