Finally OpenAI plans to tackle GPT-3’s safety issues

OpenAI started using academic datasets to evaluate language models but found that these benchmark datasets were not inclusive of the real-life dangers of safety and misuse.

In the beginning of March, OpenAI published a blog noting that for the past couple of years, researchers had been collecting findings on how their language models like GPT-3 and Codex could be misused. OpenAI admits that they did not anticipate that GPT-3 would be used in production, due to which they weren’t stringent about filtering out toxic training data with the earlier models. The company began studying the risks associated with deploying language models in 2019 when it released GPT-2. 

History of GPT

Even then, there were sufficient warning bells that rang when GPT-2 came out. The text generator could become an easy target to produce fake news online and for astroturfing, which is a way to create a fake grassroots movement in support of a cause. Astroturfing is a tactic that has been widely used by corporations like Exxon, Comcast, Walmart and even governments in the past. In a noted case in 2018, a criminal probe found that false comments were generated online to show support for the FCC’s rejection of net neutrality. The people mentioned in the comments claimed their names were used without permission. OpenAI had already preempted the safety issues that could pop up with GPT-2 and initially decided to not open-source it. But after criticism from the developers’ community, OpenAI released it in smaller model sizes in stages. 

Extent of misuse 

In June 2020, OpenAI again gave full access to OpenAI API so that developers and researchers could build upon and use applications on top of OpenAI’s new AI models. However, since GPT-2, OpenAI has learnt some tough lessons. In a paper published in 2019 called ‘Release Strategies and Social Impacts of Language Models,’ OpenAI said it collaborated with security experts and the AI community to draw inferences from the data gathered from disinformation and hate communities. As a solution, it developed proofs of concept and encouraged third parties to carefully analyse the dangers involved. 

Until then, research had shown that the response system that OpenAI had built around preventing misuse of GPT-3, including use case guidelines, content guidelines and internal detection, were restricted to fake political data or generating malware with Codex. However, detection efforts had to evolve over time as there were varied cases of misuse that were outside OpenAI’s purview of risk assessment. There were cases that OpenAI hadn’t expected, like the repeated promotion of unverified medical products or replaying racist fantasies. 

Challenges of assessing risk 

OpenAI started using academic datasets to evaluate language models but found that these benchmark datasets were not inclusive of the real-life dangers of safety and misuse. Academic datasets are not apt for informing language models that are in production, which has led OpenAI to work on new datasets and frameworks to test how safe their models are. These are due to be released soon. OpenAI’s policy guidelines have been made wider to include categories like:

  • Hate speech
  • Harassment 
  • Self-harm
  • Violence
  • Political content
  • Adult or erotic content
  • Spam
  • Deception
  • Malware

OpenAI then applied these filters to the pre-training data and filtered content out. It also developed new evaluation metrics, which it used to calculate the effect of dataset interventions. OpenAI admitted that while it was difficult to classify individual model outputs under different dimensions, it was even harder to measure the societal impact at the scale of the OpenAI API. 

The economic impact on the labour market from the deployment of these models was obviously considerable and increasing every day as the models also grew in reach. There was an increase in productivity in jobs that performed tasks like copywriting and summarising, along with cases where the API now included new applications like the synthesis of large-scale qualitative feedback. Despite this, OpenAI said it could not estimate just how much the net effect was. 


In July last year, a discussion held by the AI Security Initiative, a program started by the University of California, Berkeley, included panellists such as Carolyn Ashurst, a senior research associate in Safe and Ethical AI at the Alan Turing Institute, Rosie Campbell, a technical program manager at OpenAI and Zeerak Waseem, a PhD student from the University of Sheffield. The debate revolved around the risks posed by language models seen from the context of hate speech. 

“Language models are akin to Mary Shelley’s monster. They assume a distributive logic that we can remove something from its context and stitch it together with something else. And then, we iterate over these disembodied data as if the meaning hasn’t been methodically stripped away. And this ignores questions of where the data comes from, who the speakers are, and which communicative norms are acceptable to encode. What we end up with is our models that speak or act with no responsibility or intent,” Waseem said. 

OpenAI’s Campbell referred to a report she co-authored with Partnership on AI titled, ‘Managing the Risks of AI Research: Six Recommendations for Responsible Publication.’ Some of these recommendations included asking researchers and academic publications to be more forthright about the possible negative impact that large language models can have. She added that the earlier the issue was spotted, the better it would be, and people who point out the flaws must not be penalised. 

Carolyn Ashurst suggested that responsible deployment of models could be incentivised through various measures. One could be through governance from external authorities and secondly through self-governance by conducting peer reviews within the community. 

OpenAI mentioned in their blog that the overwhelming popularity of the InstructGPT models among developers over the base GPT-3 models was proof that users preferred safety. The InstructGPT models were a result of OpenAI fine-tuning its GPT-3 models so that they aligned better with users’ instructions instead of commercial viability. OpenAI encourages a stronger connection between a language model’s safety and its commercial utility. 

However, InstructGPT models are not without their gaps either. InstructGPT may be an improvement on GPT-3 – InstructGPT’s hallucination rate is 21 per cent as compared to GPT-3’s 41 per cent. But Jan Leike, the head of the alignment team at OpenAI, warned that InstructGPT could still be “misused” and is also “not fully aligned nor safe.” 

Open platform to help

OpenAI has welcomed researchers to get involved by way of a subsidised API credits program offered to experts working on bias and misuse. OpenAI also stated that while it had discontinued the API waitlist to instil more confidence in its own ability to respond to misuse, interested individuals could sign up for OpenAI API. 

Download our Mobile App

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox