Hanpx20/SafeSwitch

Official code repository for the paper "Internal Activation as the Polar Star for Steering Unsafe LLM Behavior"

/ 100

Emerging

When deploying large language models (LLMs) for public use, it's crucial to prevent them from generating harmful or inappropriate content. SafeSwitch helps AI safety engineers ensure LLMs remain helpful without over-censoring, by providing tools to train a 'safety prober' that identifies potentially unsafe outputs before they are fully generated and a 'refusal head' that steers the model towards safe refusals. This ensures that the LLM's valuable capabilities are preserved while preventing the creation of undesirable content.

Use this if you need to deploy an LLM that maintains high utility while dynamically and selectively preventing the generation of unsafe or harmful responses.

Not ideal if you need a simple, off-the-shelf content moderation solution for general text without needing to modify the internal workings of an LLM.

LLM deployment AI safety content moderation responsible AI generative AI

No License No Package No Dependents

Maintenance 6 / 25

Adoption 5 / 25

Maturity 8 / 25

Community 11 / 25

How are scores calculated?

Stars

Forks

Language

Jupyter Notebook

License

—

Higher-rated alternatives

cvs-health/langfair

LangFair is a Python library for conducting use-case level LLM bias and fairness assessments

BetterForAll/HonestyMeter

HonestyMeter: An NLP-powered framework for evaluating objectivity and bias in media content,...

bws82/biasclear

Structural bias detection and correction engine built on Persistent Influence Theory (PIT)

KID-22/LLM-IR-Bias-Fairness-Survey

This is the repo for the survey of Bias and Fairness in IR with LLMs.

faiyazabdullah/TranslationTangles

Uncovering Performance Gaps and Bias Patterns in LLM-Based Translations Across Language Families...

Explore LLM Tools

All categories Trending LLM Tool directory Insights