declare-lab/resta

Restore safety in fine-tuned language models through task arithmetic

/ 100

Experimental

This project helps machine learning engineers and researchers make their fine-tuned language models safer and less likely to generate harmful content. It takes a language model that has been fine-tuned for a specific task but may have lost some of its safety alignment, and by adding a 'safety vector,' produces a new version that maintains its task performance while significantly reducing harmful outputs. This is for professionals building or deploying custom Large Language Models.

No commits in the last 6 months.

Use this if you have fine-tuned a large language model for a specific task and are concerned that it now generates unsafe, biased, or harmful responses.

Not ideal if you are looking for a pre-built, ready-to-use safe language model rather than a method to improve the safety of your own custom models.

AI Safety Large Language Models Model Alignment NLP Engineering Content Moderation

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 7 / 25

Maturity 8 / 25

Community 4 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

steering-vectors/steering-vectors

Steering vectors for transformer language models in Pytorch / Huggingface

jianghoucheng/AlphaEdit

AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models, ICLR 2025 (Outstanding Paper)

kmeng01/memit

Mass-editing thousands of facts into a transformer memory (ICLR 2023)

boyiwei/alignment-attribution-code

[ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

jianghoucheng/AnyEdit

AnyEdit: Edit Any Knowledge Encoded in Language Models, ICML 2025

Explore Transformer Models

All categories Trending Transformer directory Insights