git-disl/Vaccine

This is the official code for the paper "Vaccine: Perturbation-aware Alignment for Large Language Models" (NeurIPS2024)

/ 100

Emerging

This project offers a way to 'immunize' large language models (LLMs) against being fine-tuned with harmful or undesirable content. It takes an existing LLM, applies a special alignment process, and produces a more robust LLM that resists learning harmful behaviors during subsequent fine-tuning. This is for researchers or engineers who are building and deploying LLMs and want to ensure their models remain safe and ethical.

Use this if you are developing or deploying Large Language Models and are concerned about them being manipulated or fine-tuned with malicious or harmful datasets.

Not ideal if you are a general user of an LLM and want to prevent it from generating harmful content; this tool is for those who train and align the models.

LLM-safety AI-ethics model-alignment harmful-content-prevention LLM-fine-tuning

No Package No Dependents

Maintenance 10 / 25

Adoption 8 / 25

Maturity 16 / 25

Community 10 / 25

How are scores calculated?

Stars

Forks

Language

Shell

License

Apache-2.0

Higher-rated alternatives

THU-BPM/MarkLLM

MarkLLM: An Open-Source Toolkit for LLM Watermarking.（EMNLP 2024 System Demonstration)

zjunlp/Deco

[ICLR 2025] MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation

HillZhang1999/ICD

Code & Data for our Paper "Alleviating Hallucinations of Large Language Models through Induced...

voidism/DoLa

Official implementation for the paper "DoLa: Decoding by Contrasting Layers Improves Factuality...

kaist-cvml/I-HallA-v1.0

[AAAI 2025] Official Implementation of I-HallA v1.0

Explore Transformer Models

All categories Trending Transformer directory Insights