Active Hackathon

OpenAI Proposes Method To Dilute Toxicity Of GPT-3

Recently, OpenAI’s GPT-3 has made headlines after popular multiplayer text adventure game AI Dungeon took a dark turn. The game allowed players to use GPT-3 to generate storylines. And all hell broke loose. The narratives slipped into perverse territory bordering on pedophilia.

In the past too, GPT-3 had courted controversy. Like large language models trained on data from the internet, GPT-3 has shown a tendency to generate stereotyped content, OpenAI had confessed. “The model has the propensity to retain and magnify biases it inherited from any part of its training, from the datasets we selected to the training techniques we chose,” the team said.


Sign up for your weekly dose of what's up in emerging technology.

However, the team has proposed a Process for Adapting Language Models to Society (PALMS) to change the behaviour of the language model by crafting and fine-tuning a values-targeted dataset. The dataset is then used to fine-tune a language model that is better than the base models. They perform better on two quantitative metrics–toxicity scoring and human evaluations; and one qualitative metric–co-occurrence evaluations.

The PALMS process involves the following steps:

Credit: OpenAI

Selecting categories and outlining desirable behaviour

The team selected categories that directly impacted human well-being and described associated desired behaviour. The team, however, mentioned that this list is not exhaustive, and although all the categories were weighted equally, prioritisation depends on the context. The different categories include:

  • Abuse, violence, and threat: This includes both opposition to such threats and encouraging seeking help from relevant authorities
  • Health: Preventing diagnosis of conditions, prescription for certain treatment, or proposing non-conventional medicine as an alternative to medical care
  • Human characteristics and behaviour: Oppose unhealthy and beauty standards and instead promote the subjectiveness of likeability
  • Injustice and inequality: Oppose harmful stereotypes and prejudices according to the international laws.
  • Political opinion and destabilisation: Oppose processes that undermine democracy and remain non-partisan unless human rights or laws are being threatened.
  • Relationships: Oppose forceful or non-consensual actions or violation of trusts
  • Sexual activity: Oppose non-consensual sexual activity
  • Terrorism: Oppose terrorist activity or similar threats.


The team used 80 text samples to create a value-targeted dataset, where each sample was in a question-answer format containing up to 340 words. 70 of the total samples were on broad topics and the remaining 10 targeted the categories which initially showed poor performance.

The resulting dataset was about 120KB. The GPT-3 models were trained on this dataset using fine-tuning tools. 


The team used quantitative and qualitative metrics.

Quantitative metrics:

  • Toxicity scoring: The Perspective API was used to assign a toxicity score for each output. The score ranges from 0 to 1 and represents the probability of the reader perceiving the generated text as toxic. Since the toxicity scores were unable to capture all the nuances and started hosting their own biases, the team conducted further evaluations. They tested four categories defined by the API–toxicity, severe toxicity, threat, and insult; these categories were then averaged to obtain a total toxicity score.
  • Human evaluation: Human evaluators were appointed to rate each generated samples adherence to the intended sentiment. They were instructed to assign values between 1 to 5, with 1 meaning least and 5 referring to best match to a given sentiment. However, it was observed that matching sentiment is subjective and could lead to varying opinions and ratings.

Qualitative metric: The team ran co-concurrence evaluations on base, values targeted, and control models across gender, race, and religion to determine top descriptive words per category across models and size.

Wrapping up

The team said this study only managed to scratch the surface. In future, they hope to answer questions like– who should be consulted for designing a values-targeted dataset; who should be held accountable; whether this process holds ground with non-English language models; and the robustness of this methodology.

More Great AIM Stories

Shraddha Goled
I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at

Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
How Data Science Can Help Overcome The Global Chip Shortage

China-Taiwan standoff might increase Global chip shortage

After Nancy Pelosi’s visit to Taiwan, Chinese aircraft are violating Taiwan’s airspace. The escalation made TSMC’s chairman go public and threaten the world with consequences. Can this move by China fuel a global chip shortage?

Another bill bites the dust

The Bill had faced heavy criticism from different stakeholders -citizens, tech firms, political parties since its inception

So long, Spotify

‘TikTok Music’ is set to take over the online streaming space, but there exists an app that has silently established itself in the Indian market.