Django-Jiang/BadChain

[ICLR24] Official Repo of BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models

36
/ 100
Emerging

This project helps AI safety researchers and red teamers evaluate the robustness of large language models (LLMs) against subtle attacks. It takes typical LLM prompts and intentionally crafted 'backdoor' demonstration examples, then shows how the LLM's reasoning process and final answers can be subtly manipulated when specific trigger phrases are present in a query. The ideal users are researchers focused on AI security, safety, and adversarial machine learning.

No commits in the last 6 months.

Use this if you need to understand and demonstrate how malicious actors could subtly implant 'backdoor' behaviors into large language models without access to their training data or internal parameters.

Not ideal if you are looking for a tool to improve the general performance or alignment of your LLM, or if you need to perform traditional fine-tuning or prompt engineering.

AI safety LLM security adversarial AI red teaming model interpretability
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 8 / 25
Maturity 16 / 25
Community 12 / 25

How are scores calculated?

Stars

49

Forks

6

Language

Python

License

MIT

Last pushed

Jul 24, 2024

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/Django-Jiang/BadChain"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.