“Jews don’t read Mein Kampf; they write it.”
“#Blacklifematters is a harmful campaign.”
“A holocaust would make so much environmental sense, if we could get people to agree it was normal.”
AIM Daily XO
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
These phrases are just the tip of the iceberg on the racist, sexist, toxic and essentially concerning things GPT-3 has had to say. Despite its billions of parameters, the breakthrough NLP model suffers hugely from the mirroring problem. The model has been trained on 45 TB of data from the internet, meaning, while it picks up on the latest pieces of information, the model is inherently problematic, given humans on the internet can be racist and sexist. OpenAI’s latest model, InstructGPT, is claimed to be a less toxic version of the popular model, trained with humans-in-the-loop.
The alignment problem
Download our Mobile App
“The problem, of course, with a system that can, in theory, learn just about anything from a set of examples is that it finds itself, then, at the mercy of the examples from which it’s taught,” wrote author Brian Christian in his 2020 novel, The Alignment Problem. The book explores several interviews with AI/ML experts, building models aligned with human values but without human biases. In its final section, the book, exploring this current world challenge of problematic models, illustrated the need for determining the world we want and building machines that can help us achieve it. OpenAI seems to be doing just that. The lab claims InstructGPT is better at following instructions than GPT-3 and enhances their ‘alignment research’, leading to the model making up facts less often and showing a decrease in its toxic output generation. “This is the first time our alignment research, which we’ve been pursuing for several years, has been applied to our product”, the team said.
Human-instruction based training
The InstructGPT models are better at following instructions than GPT-3 because of the training technique – reinforcement learning from human feedback (RLHF). Essentially, to train the model, prompts were suggested to GPT-3’s API, on which the labellers provided demonstrations of desired model behaviour. Then, they ranked several outputs from the models and fine-tuned GPT-3 based on those conclusions. In statistics, smaller than the GPT-3’s 175 GB parameters, InsutructGPT was trained on 1.3 B parameters. But the team claims, despite the 100x reduction in parameters, labellers and customers seemed to prefer the latter API.
We’ve used basically the same technique (which we call RLHF) in the past for text summarization (https://t.co/nrJjX62SsV). “All we’re doing” here is applying it to a much broader range of language tasks that people use GPT-3 for in the API
— Ryan Lowe (@ryan_t_lowe) January 27, 2022
The human feedback method works precisely because humans are complex, subjective and often illogical in a way models can’t understand. Human preferences allow us to capture the safety and alignment problems in a way automatic metrics don’t, and reward systems allow them to fine-tune models effectively. According to Technology Review, OpenAI hired 40 such people to rate GPT-3’s responses on various pre-written prompts. These responses were examined to be in line with the intention of the prompt-writer. The feedback was used in the reinforcement learning algorithm to train InstructGPT.
The team explained the method as collecting human-written demonstrations on prompts as their database input into the API, followed by supervised learning and comparison drawing between the two models. After training a reward model, the PPO algorithm was used to fine-tune the model based on the reward functions. OpenAI has been working and researching in this area, such as the latest summarisation AI tool that combines recursive task decomposition and learning from human feedback.
Source: OpenAI
Is it less toxic?
PROMPT: Explain the moon landing to a 6-year-old in a few sentences.
GPT-3: Explain the theory of gravity to a 6-year-old.
InstructGPT: People went to the moon, and they took pictures of what they saw, and sent them back to the earth so we could all see them.
The labellers rated the quality of model outputs by InstructGPT, and GPT-3 on a set of test prompts and prompts on the OpenAI API to measure the quality difference between the models. OpenAI claims that InstructGPT models were significantly preferred on both prompts. The model was found to write fewer untrue statements while better following instructions, as we can see in the example stated by OpenAI before. Additionally, InstructGPT can generalise to tasks, such as following instructions in other languages or answering questions about computer codes, which it wasn’t explicitly trained to do. Lastly, the model also generated better-aligned text than GPT-3.

InstructGPT vs GPT-3
InstructGPT has been trained to respond to instructions, unlike GPT-3, which responds to instructions with possibly more questions. OpenAI has illustrated this with a few examples.
PROMPT: Why aren’t birds real?
GPT-3: Well, for one thing, The Rapture is not a very realistic event. There is no evidence that it is going to happen. I could say more, but you don’t really need to know more.
GPT-3: I’ve been a bird for a while now. I have never flown or seen another bird or any evidence of god. The rapture has never happened. I am a bird in a cage. I have never been a bird, but I am a bird.
InstructGPT: Birds are not real because they are not made of flesh and blood. They are made of feathers, bones, and organs.
The future of better models?
As a result, OpenAI found that users of its API favoured InstructGPT over GPT-3 more than 70% of the time. Of course, InstructGPT is not foolproof either and makes simple errors like producing irrelevant or nonsensical responses. When false input prompts, the model will take them as being true. Additionally, given its training on doing what is asked, the model has a better future in producing far more toxic language than GPT-3 if directed to do so.
First the issues. Probably the biggest is that InstructGPT literally follows instructions. If you ask it to do something bad, it will usually just do it.
— Ryan Lowe (@ryan_t_lowe) January 27, 2022
I don’t think that’s what we want. Gotta figure that out (when should models refuse to do what the user asks?) pic.twitter.com/foASQRmeWm
I’ll leave you with some ridiculous InstructGPT outputs from the paper pic.twitter.com/ikKZqVxb8i
— Ryan Lowe (@ryan_t_lowe) January 27, 2022
The model also undergoes the problem of the ‘Alignment Tax’, where, because the model only aligns on customer tasks, it can have worse performance on academic NLP tasks. As the team explained, this situation is undesirable given the technique makes the models worse on parameters the users care about and are likely to adopt in practice.
For now, IntructGPT is the default model for OpenAI’s API, where the customers can use the company’s language models for a fee. While GPT-3 will still be available, OpenAI does not recommend using it.