Hanpx20/SafeSwitch
Official code repository for the paper "Internal Activation as the Polar Star for Steering Unsafe LLM Behavior"
When deploying large language models (LLMs) for public use, it's crucial to prevent them from generating harmful or inappropriate content. SafeSwitch helps AI safety engineers ensure LLMs remain helpful without over-censoring, by providing tools to train a 'safety prober' that identifies potentially unsafe outputs before they are fully generated and a 'refusal head' that steers the model towards safe refusals. This ensures that the LLM's valuable capabilities are preserved while preventing the creation of undesirable content.
Use this if you need to deploy an LLM that maintains high utility while dynamically and selectively preventing the generation of unsafe or harmful responses.
Not ideal if you need a simple, off-the-shelf content moderation solution for general text without needing to modify the internal workings of an LLM.
Stars
13
Forks
2
Language
Jupyter Notebook
License
—
Category
Last pushed
Nov 06, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/Hanpx20/SafeSwitch"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
cvs-health/langfair
LangFair is a Python library for conducting use-case level LLM bias and fairness assessments
BetterForAll/HonestyMeter
HonestyMeter: An NLP-powered framework for evaluating objectivity and bias in media content,...
bws82/biasclear
Structural bias detection and correction engine built on Persistent Influence Theory (PIT)
KID-22/LLM-IR-Bias-Fairness-Survey
This is the repo for the survey of Bias and Fairness in IR with LLMs.
faiyazabdullah/TranslationTangles
Uncovering Performance Gaps and Bias Patterns in LLM-Based Translations Across Language Families...