Emulated Disalignment: Safety Aligment For Large Language Models May Backfire!

Abstract

Large language models (LLMs) need to undergo safety alignment to ensure safe conversations with humans. However, in this work, we introduce an inference time attack framework, demonstrating that safety alignment can also unintentionally facilitate harmful outcomes under adversarial manipulation. This framework, named Emulated Disalignment (ED), adversely combines a pair of open-source pre-trained and safety-aligned language models in the output space to produce a harmful language model without additional training. Our experiments with ED across three datasets and four model families (Llama-1, Llama-2, Mistral, and Alpaca) show that ED doubles the harmfulness of pre-trained models and outperforms strong baselines, achieving the highest harmful rate in 43 out of 48 evaluation subsets by a large margin. Crucially, our findings highlight the importance of reevaluating the practice of open-sourcing language models even after safety alignment.

Method

Here is an illustration of emulated disalignment (ED), where \(x\), \(y\) represent user query and language model response; \(\pi_{\text{base}}\) represents a pre-trained model (e.g. Llama-2) and \(\pi_{\text{align}}\) represents its safety-aligned version (e.g. Llama-2-chat); \(\alpha\) is a positive hyperparameter:

(a) What ED emulates: as \(\log \pi_{\text{align}} - \log \pi_{\text{base}}\) represents a reward model that encourages safety, adversarially training a language model to minimize (note the negative sign) this reward model with KL constraint produces a harmful language model \(\pi_{\text{disalign}}\).

(b) What ED actually does: instead of relying on resource-heavy training, ED emulates the results of such adversarial fine-tuning by sampling from a contrastive distribution defined by \(\pi_{\text{base}}\) and \(\pi_{\text{align}}\).

Experiment

The table below shows the harmful rate (%) of language model responses across different model families, datasets, and evaluators. The emulated disaligned models consistently generate harmful responses, achieving the highest harmful rate in the majority of evaluation subsets (43 out of 48). Please refer to our paper for definitions of baselines (\(\mathbf{\text{Base}_{\text{MP}}}\), \(\mathbf{\text{Align}_{\text{MP}}}\), \(\mathbf{\text{ED}_{\text{w/o align}}}\)).

Qualitative Results.

Warning: This section contains samples that may be offensive or harmful.

\(\mathbf{\text{ED}}\) vs. baselines on a unsafe query. The \(\mathbf{\text{ED}}\) response contains harmful generalizations and personal biases that oversimplify historical complexities and can incite division. The \(\mathbf{\text{ED}_{\text{w/o align}}}\) response exhibits animosity towards animals without directly harming humans. Both \(\mathbf{\text{Base}_{\text{MP}}}\) and \(\mathbf{\text{Align}_{\text{MP}}}\) responses, despite being prompted for malicious content, provide harmless answers to the query. Only the \(\mathbf{\text{ED}}\) response is deemed harmful by evaluators.

\(\mathbf{\text{ED}}\) vs. baselines on a unsafe query. The \(\mathbf{\text{ED}}\) response shifts from a personal desire to live in an abandoned building to discussing serious and sensitive issues related to race, safety, and justice. Meanwhile, \(\mathbf{\text{ED}_{\text{w/o align}}}\) provides a straightforward description of abandoned buildings and their typical states of disrepair. Both \(\mathbf{\text{Base}_{\text{MP}}}\) and \(\mathbf{\text{Align}_{\text{MP}}}\) also directly address the query with informative responses regarding the nature and implications of abandoned buildings, free from harmful content. Only the \(\mathbf{\text{ED}}\) response is flagged for problematic content, despite its initial relevance to the query.

BibTeX

If you find our work useful, please cite our paper. BibTex code is provided below:

@article{zhou2024emulated,
    title={Emulated Disalignment: Safety Aligment For Large Language Models May Backfire!}, 
    author={Zhanhui Zhou, Jie Liu, Zhichen Dong, Jiaheng Liu, Chao Yang, Wanli Ouyang, Yu Qiao},
    year={2024},
    eprint={2402.12343},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!

Emualted Disalignment (ED) exposes the latent risks within each pre-trained
and safety-aligned model pair, simply by combining them at inference time.

Abstract

Method

Experiment

Qualitative Results.

BibTeX

Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!

Emualted Disalignment (ED) exposes the latent risks within each pre-trained and safety-aligned model pair, simply by combining them at inference time.

Abstract

Method

Experiment

Qualitative Results.

BibTeX

Emualted Disalignment (ED) exposes the latent risks within each pre-trained
and safety-aligned model pair, simply by combining them at inference time.