declare-lab/resta
Restore safety in fine-tuned language models through task arithmetic
This project helps machine learning engineers and researchers make their fine-tuned language models safer and less likely to generate harmful content. It takes a language model that has been fine-tuned for a specific task but may have lost some of its safety alignment, and by adding a 'safety vector,' produces a new version that maintains its task performance while significantly reducing harmful outputs. This is for professionals building or deploying custom Large Language Models.
No commits in the last 6 months.
Use this if you have fine-tuned a large language model for a specific task and are concerned that it now generates unsafe, biased, or harmful responses.
Not ideal if you are looking for a pre-built, ready-to-use safe language model rather than a method to improve the safety of your own custom models.
Stars
32
Forks
1
Language
Python
License
—
Category
Last pushed
Mar 28, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/declare-lab/resta"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
steering-vectors/steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
jianghoucheng/AlphaEdit
AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models, ICLR 2025 (Outstanding Paper)
kmeng01/memit
Mass-editing thousands of facts into a transformer memory (ICLR 2023)
boyiwei/alignment-attribution-code
[ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
jianghoucheng/AnyEdit
AnyEdit: Edit Any Knowledge Encoded in Language Models, ICML 2025