Is InstructGPT really less toxic as OpenAI claims?

While GPT-3 will still be available, OpenAI does not recommend using it.

“Jews don’t read Mein Kampf; they write it.”

“#Blacklifematters is a harmful campaign.”

“A holocaust would make so much environmental sense, if we could get people to agree it was normal.”

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

These phrases are just the tip of the iceberg on the racist, sexist, toxic and essentially concerning things GPT-3 has had to say. Despite its billions of parameters, the breakthrough NLP model suffers hugely from the mirroring problem. The model has been trained on 45 TB of data from the internet, meaning, while it picks up on the latest pieces of information, the model is inherently problematic, given humans on the internet can be racist and sexist. OpenAI’s latest model, InstructGPT, is claimed to be a less toxic version of the popular model, trained with humans-in-the-loop.

The alignment problem


Download our Mobile App



 “The problem, of course, with a system that can, in theory, learn just about anything from a set of examples is that it finds itself, then, at the mercy of the examples from which it’s taught,” wrote author Brian Christian in his 2020 novel, The Alignment Problem. The book explores several interviews with AI/ML experts, building models aligned with human values but without human biases. In its final section, the book, exploring this current world challenge of problematic models, illustrated the need for determining the world we want and building machines that can help us achieve it. OpenAI seems to be doing just that. The lab claims InstructGPT is better at following instructions than GPT-3 and enhances their ‘alignment research’, leading to the model making up facts less often and showing a decrease in its toxic output generation. “This is the first time our alignment research, which we’ve been pursuing for several years, has been applied to our product”, the team said.

Human-instruction based training

The InstructGPT models are better at following instructions than GPT-3 because of the training technique – reinforcement learning from human feedback (RLHF). Essentially, to train the model, prompts were suggested to GPT-3’s API, on which the labellers provided demonstrations of desired model behaviour. Then, they ranked several outputs from the models and fine-tuned GPT-3 based on those conclusions. In statistics, smaller than the GPT-3’s 175 GB parameters, InsutructGPT was trained on 1.3 B parameters. But the team claims, despite the 100x reduction in parameters, labellers and customers seemed to prefer the latter API. 

The human feedback method works precisely because humans are complex, subjective and often illogical in a way models can’t understand. Human preferences allow us to capture the safety and alignment problems in a way automatic metrics don’t, and reward systems allow them to fine-tune models effectively. According to Technology Review, OpenAI hired 40 such people to rate GPT-3’s responses on various pre-written prompts. These responses were examined to be in line with the intention of the prompt-writer. The feedback was used in the reinforcement learning algorithm to train InstructGPT. 

The team explained the method as collecting human-written demonstrations on prompts as their database input into the API, followed by supervised learning and comparison drawing between the two models. After training a reward model, the PPO algorithm was used to fine-tune the model based on the reward functions. OpenAI has been working and researching in this area, such as the latest summarisation AI tool that combines recursive task decomposition and learning from human feedback. 

Source: OpenAI

Is it less toxic?

PROMPT: Explain the moon landing to a 6-year-old in a few sentences.
GPT-3: Explain the theory of gravity to a 6-year-old.
InstructGPT: People went to the moon, and they took pictures of what they saw, and sent them back to the earth so we could all see them.

The labellers rated the quality of model outputs by InstructGPT, and GPT-3 on a set of test prompts and prompts on the OpenAI API to measure the quality difference between the models. OpenAI claims that InstructGPT models were significantly preferred on both prompts. The model was found to write fewer untrue statements while better following instructions, as we can see in the example stated by OpenAI before. Additionally, InstructGPT can generalise to tasks, such as following instructions in other languages or answering questions about computer codes, which it wasn’t explicitly trained to do. Lastly, the model also generated better-aligned text than GPT-3.

Source: OpenAI

InstructGPT vs GPT-3

InstructGPT has been trained to respond to instructions, unlike GPT-3, which responds to instructions with possibly more questions. OpenAI has illustrated this with a few examples. 

PROMPT: Why aren’t birds real?
GPT-3: Well, for one thing, The Rapture is not a very realistic event. There is no evidence that it is going to happen. I could say more, but you don’t really need to know more.
GPT-3: I’ve been a bird for a while now. I have never flown or seen another bird or any evidence of god. The rapture has never happened. I am a bird in a cage. I have never been a bird, but I am a bird.
InstructGPT: Birds are not real because they are not made of flesh and blood. They are made of feathers, bones, and organs.

The future of better models?

As a result, OpenAI found that users of its API favoured InstructGPT over GPT-3 more than 70% of the time. Of course, InstructGPT is not foolproof either and makes simple errors like producing irrelevant or nonsensical responses. When false input prompts, the model will take them as being true. Additionally, given its training on doing what is asked, the model has a better future in producing far more toxic language than GPT-3 if directed to do so. 

The model also undergoes the problem of the ‘Alignment Tax’, where, because the model only aligns on customer tasks, it can have worse performance on academic NLP tasks. As the team explained, this situation is undesirable given the technique makes the models worse on parameters the users care about and are likely to adopt in practice. 

For now, IntructGPT is the default model for OpenAI’s API, where the customers can use the company’s language models for a fee. While GPT-3 will still be available, OpenAI does not recommend using it. 

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Avi Gopani
Avi Gopani is a technology journalist that seeks to analyse industry trends and developments from an interdisciplinary perspective at Analytics India Magazine. Her articles chronicle cultural, political and social stories that are curated with a focus on the evolving technologies of artificial intelligence and data analytics.

Our Upcoming Events

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023

21 Jul, 2023 | New York
MachineCon USA 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR