OpenAI has trained language models that are much better at following user intentions than GPT-3. The InstructGPT models are trained with humans in the loop and are deployed as the default language models on the OpenAI API. The team claims to have made them more truthful and less toxic by using techniques developed through alignment research.
The OpenAI API is powered by GPT-3 language models that can perform natural language tasks using carefully engineered text prompts. But these models sometimes generate outputs that are untruthful, toxic, or reflect harmful sentiments.
To make the models safer, helpful, and aligned, OpenAI used reinforcement learning from human feedback (RLHF) to fine-tune GPT-3. This has made the resulting InstructGPT models much better at following instructions than GPT-3.
InstructGPT models have been in beta on the API for more than a year. This is the first time that OpenAI has applied their alignment research to their product.
Compared to GPT-3, InstructGPT produces fewer imitative falsehoods (according to TruthfulQA) and are less toxic (according to RealToxicityPrompts). The team also conducted human evaluations on their API prompt distribution, and found that InstructGPT makes up facts (“hallucinates”) less often, and generates more appropriate outputs.
According to OpenAI, InstructGPT “unlocks” the capabilities GPT-3 already had, but were difficult to elicit through prompt engineering alone. “This is because the training procedure has a limited ability to teach the model new capabilities relative to what is learned during pretraining, since it uses less than 2% of the compute and data relative to model pretraining,” according to their official blog.
OpenAI team also warned that, despite making significant progress, the InstructGPT models are far from fully aligned or fully safe and still generate toxic or biased outputs, make up facts, and generate sexual and violent content without explicit prompting. “But the safety of a machine learning system depends not only on the behavior of the underlying models, but also on how these models are deployed. To support the safety of our API, we will continue to review potential applications before they go live, provide content filters for detecting unsafe completions, and monitor for misuse.”