git-disl/Lisa

This is the official code for the paper "Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning" (NeurIPS2024)

/ 100

Experimental

This project helps machine learning engineers and researchers make large language models (LLMs) safer by defending against harmful fine-tuning. It takes a pre-trained LLM and a dataset for fine-tuning, then processes them to produce a fine-tuned LLM that is more resistant to generating unsafe content, even if trained on malicious data. The primary users are those responsible for deploying and maintaining safe AI systems.

No commits in the last 6 months.

Use this if you are a machine learning engineer concerned about your large language models being manipulated by harmful fine-tuning data.

Not ideal if you are looking for a defense against harmful content at the pre-training or post-fine-tuning stages, as this tool specifically targets the fine-tuning process.

AI safety large language models model fine-tuning responsible AI AI security

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 7 / 25

Maturity 16 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

Python

License

Apache-2.0

Higher-rated alternatives

steering-vectors/steering-vectors

Steering vectors for transformer language models in Pytorch / Huggingface

jianghoucheng/AlphaEdit

AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models, ICLR 2025 (Outstanding Paper)

kmeng01/memit

Mass-editing thousands of facts into a transformer memory (ICLR 2023)

boyiwei/alignment-attribution-code

[ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

jianghoucheng/AnyEdit

AnyEdit: Edit Any Knowledge Encoded in Language Models, ICML 2025

Explore Transformer Models

All categories Trending Transformer directory Insights