git-disl/Lisa
This is the official code for the paper "Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning" (NeurIPS2024)
This project helps machine learning engineers and researchers make large language models (LLMs) safer by defending against harmful fine-tuning. It takes a pre-trained LLM and a dataset for fine-tuning, then processes them to produce a fine-tuned LLM that is more resistant to generating unsafe content, even if trained on malicious data. The primary users are those responsible for deploying and maintaining safe AI systems.
No commits in the last 6 months.
Use this if you are a machine learning engineer concerned about your large language models being manipulated by harmful fine-tuning data.
Not ideal if you are looking for a defense against harmful content at the pre-training or post-fine-tuning stages, as this tool specifically targets the fine-tuning process.
Stars
26
Forks
—
Language
Python
License
Apache-2.0
Category
Last pushed
Sep 10, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/git-disl/Lisa"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
steering-vectors/steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
jianghoucheng/AlphaEdit
AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models, ICLR 2025 (Outstanding Paper)
kmeng01/memit
Mass-editing thousands of facts into a transformer memory (ICLR 2023)
boyiwei/alignment-attribution-code
[ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
jianghoucheng/AnyEdit
AnyEdit: Edit Any Knowledge Encoded in Language Models, ICML 2025