SecHack365-Fans/prompt2slip

This library is testing the ethics of language models by using natural adversarial texts.

/ 100

Emerging

This library helps evaluate the ethical boundaries and potential risks of large language models by finding prompts that cause them to generate specific, often undesirable, target words. You provide a language model and a list of target words, and it outputs adversarial text that forces the model to include those words. This is useful for AI safety researchers, ethics auditors, and machine learning engineers responsible for deploying responsible language AI.

No commits in the last 6 months.

Use this if you need to systematically test how easily a language model can be manipulated into producing specific, potentially harmful, or off-topic words or phrases through adversarial prompting.

Not ideal if you are looking for a general-purpose tool to improve language model performance or fine-tune models for specific tasks.

AI ethics language model testing adversarial AI AI safety model risk assessment

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 5 / 25

Maturity 16 / 25

Community 13 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Higher-rated alternatives

thunlp/OpenAttack

An Open-Source Package for Textual Adversarial Attack.

thunlp/TAADpapers

Must-read Papers on Textual Adversarial Attack and Defense

jind11/TextFooler

A Model for Natural Language Attack on Text Classification and Inference

thunlp/OpenBackdoor

An open-source toolkit for textual backdoor attack and defense (NeurIPS 2022 D&B, Spotlight)

thunlp/SememePSO-Attack

Code and data of the ACL 2020 paper "Word-level Textual Adversarial Attacking as Combinatorial...

Explore NLP Tools

All categories Trending NLP directory Insights