zjunlp/AutoSteer
[EMNLP 2025] AutoSteer: Automating Steering for Safe Multimodal Large Language Models
AutoSteer is a framework for ensuring multimodal large language models (MLLMs) produce safe and appropriate outputs. It takes an existing MLLM and training datasets containing potentially harmful images (like NSFW or violent content), then provides a way to train and integrate a 'steer matrix' and 'prober' to reduce harmful responses. This tool is for AI safety researchers or MLLM developers focused on mitigating bias and toxic output.
No commits in the last 6 months.
Use this if you are developing or deploying multimodal large language models and need to reduce the generation of harmful content during inference.
Not ideal if you are looking for a pre-packaged, zero-configuration solution for general-purpose MLLM safety without needing to interact with model internals or dataset preparation.
Stars
13
Forks
3
Language
Python
License
MIT
Category
Last pushed
Aug 21, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/zjunlp/AutoSteer"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
zjunlp/KnowledgeEditingPapers
Must-read Papers on Knowledge Editing for Large Language Models.
zjunlp/CaKE
[EMNLP 2025] Circuit-Aware Editing Enables Generalizable Knowledge Learners
zjunlp/unlearn
[ACL 2025] Knowledge Unlearning for Large Language Models
OFA-Sys/Ditto
A self-ailgnment method for role-play. Benchmark for role-play. Resources for "Large Language...
VinAIResearch/HPR
Householder Pseudo-Rotation: A Novel Approach to Activation Editing in LLMs with...