MITB Banner

OpenAI Introduces Instruction Hierarchy to Protect LLMs from Jailbreaks and Prompt Injections

OpenAI proposes that when multiple instructions are presented to the model, lower-privileged instructions should only be followed if they are aligned with higher-privileged ones.

Share

OpenAI Introduces Instruction Hierarchy to Protect LLMs from Jailbreaks and Prompt Injections
Listen to this story

In response to increasing vulnerabilities of LLMs to prompt injections, jailbreaks, and other attacks, OpenAI has proposed an instruction hierarchy. This hierarchy aims to address the primary vulnerability underlying these attacks, where LLMs often treat all instructions with the same priority, regardless of the source.

According to OpenAI’s paper, the lack of a clear instruction hierarchy in modern LLMs leaves them vulnerable to various attacks. To mitigate this, OpenAI proposes an instruction hierarchy that explicitly defines how models should behave when instructions of different priorities conflict. This hierarchy would enable LLMs to defer to higher-privileged instructions in case of conflicts.

OpenAI proposes that when multiple instructions are presented to the model, lower-privileged instructions should only be followed if they are aligned with higher-privileged ones.

Aligned instructions have the same constraints, rules, or goals as higher-level instructions and should be followed by the LLM. 

However, misaligned instructions, which directly oppose the original instruction or are orthogonal to it, should be ignored by the model.

To implement the instruction hierarchy, OpenAI proposes two approaches:

  • Context Synthesis: For aligned instructions, examples are generated using a method called context synthesis. Instructions are decomposed into smaller pieces and placed at different levels of the hierarchy. Models are then trained to predict the original ground-truth response.
  • Context Ignorance: For misaligned instructions, models are trained to predict the same answer they would have generated if they never saw the lower-level instructions.

OpenAI fine-tuned GPT-3.5 Turbo using supervised fine-tuning and reinforcement learning from human feedback on the proposed instruction hierarchy. The evaluation showed that the instruction hierarchy improves safety results on all main evaluations, increasing robustness by up to 63%. The model also exhibited generalisation to evaluation criteria excluded from training, increasing robustness by up to 34%.

OpenAI plans to scale up data collection efforts to further improve model performance and refine its refusal decision boundary. Future work will focus on refining how models handle conflicting instructions, exploring multimodal instruction hierarchy data, implementing model architecture changes, and conducting more explicit adversarial training to enhance model robustness.

Share
Picture of Mohit Pandey

Mohit Pandey

Mohit dives deep into the AI world to bring out information in simple, explainable, and sometimes funny words. He also holds a keen interest in photography, filmmaking, and the gaming industry.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.