Hanpx20/SafeSwitch

Official code repository for the paper "Internal Activation as the Polar Star for Steering Unsafe LLM Behavior"

30
/ 100
Emerging

When deploying large language models (LLMs) for public use, it's crucial to prevent them from generating harmful or inappropriate content. SafeSwitch helps AI safety engineers ensure LLMs remain helpful without over-censoring, by providing tools to train a 'safety prober' that identifies potentially unsafe outputs before they are fully generated and a 'refusal head' that steers the model towards safe refusals. This ensures that the LLM's valuable capabilities are preserved while preventing the creation of undesirable content.

Use this if you need to deploy an LLM that maintains high utility while dynamically and selectively preventing the generation of unsafe or harmful responses.

Not ideal if you need a simple, off-the-shelf content moderation solution for general text without needing to modify the internal workings of an LLM.

LLM deployment AI safety content moderation responsible AI generative AI
No License No Package No Dependents
Maintenance 6 / 25
Adoption 5 / 25
Maturity 8 / 25
Community 11 / 25

How are scores calculated?

Stars

13

Forks

2

Language

Jupyter Notebook

License

Last pushed

Nov 06, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/Hanpx20/SafeSwitch"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.