Last updated April 24, 2024
In AI News & Update

OpenAI Introduces Instruction Hierarchy to Protect LLMs from Jailbreaks and Prompt Injections

OpenAI proposes that when multiple instructions are presented to the model, lower-privileged instructions should only be followed if they are aligned with higher-privileged ones.

Share

Published on April 23, 2024

by Mohit Pandey

Listen to this story

In response to increasing vulnerabilities of LLMs to prompt injections, jailbreaks, and other attacks, OpenAI has proposed an instruction hierarchy. This hierarchy aims to address the primary vulnerability underlying these attacks, where LLMs often treat all instructions with the same priority, regardless of the source.

According to OpenAI’s paper, the lack of a clear instruction hierarchy in modern LLMs leaves them vulnerable to various attacks. To mitigate this, OpenAI proposes an instruction hierarchy that explicitly defines how models should behave when instructions of different priorities conflict. This hierarchy would enable LLMs to defer to higher-privileged instructions in case of conflicts.

OpenAI proposes that when multiple instructions are presented to the model, lower-privileged instructions should only be followed if they are aligned with higher-privileged ones.

Aligned instructions have the same constraints, rules, or goals as higher-level instructions and should be followed by the LLM.

However, misaligned instructions, which directly oppose the original instruction or are orthogonal to it, should be ignored by the model.

To implement the instruction hierarchy, OpenAI proposes two approaches:

Context Synthesis: For aligned instructions, examples are generated using a method called context synthesis. Instructions are decomposed into smaller pieces and placed at different levels of the hierarchy. Models are then trained to predict the original ground-truth response.
Context Ignorance: For misaligned instructions, models are trained to predict the same answer they would have generated if they never saw the lower-level instructions.

OpenAI fine-tuned GPT-3.5 Turbo using supervised fine-tuning and reinforcement learning from human feedback on the proposed instruction hierarchy. The evaluation showed that the instruction hierarchy improves safety results on all main evaluations, increasing robustness by up to 63%. The model also exhibited generalisation to evaluation criteria excluded from training, increasing robustness by up to 34%.

OpenAI plans to scale up data collection efforts to further improve model performance and refine its refusal decision boundary. Future work will focus on refining how models handle conflicting instructions, exploring multimodal instruction hierarchy data, implementing model architecture changes, and conducting more explicit adversarial training to enhance model robustness.

Access all our open Survey & Awards Nomination forms in one place