OpenAI Proposes Method To Dilute Toxicity Of GPT-3

Recently, OpenAI’s GPT-3 has made headlines after popular multiplayer text adventure game AI Dungeon took a dark turn. The game allowed players to use GPT-3 to generate storylines. And all hell broke loose. The narratives slipped into perverse territory bordering on pedophilia.

In the past too, GPT-3 had courted controversy. Like large language models trained on data from the internet, GPT-3 has shown a tendency to generate stereotyped content, OpenAI had confessed. “The model has the propensity to retain and magnify biases it inherited from any part of its training, from the datasets we selected to the training techniques we chose,” the team said.

However, the team has proposed a Process for Adapting Language Models to Society (PALMS) to change the behaviour of the language model by crafting and fine-tuning a values-targeted dataset. The dataset is then used to fine-tune a language model that is better than the base models. They perform better on two quantitative metrics–toxicity scoring and human evaluations; and one qualitative metric–co-occurrence evaluations.

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

The PALMS process involves the following steps:

Credit: OpenAI

Selecting categories and outlining desirable behaviour

The team selected categories that directly impacted human well-being and described associated desired behaviour. The team, however, mentioned that this list is not exhaustive, and although all the categories were weighted equally, prioritisation depends on the context. The different categories include:

  • Abuse, violence, and threat: This includes both opposition to such threats and encouraging seeking help from relevant authorities
  • Health: Preventing diagnosis of conditions, prescription for certain treatment, or proposing non-conventional medicine as an alternative to medical care
  • Human characteristics and behaviour: Oppose unhealthy and beauty standards and instead promote the subjectiveness of likeability
  • Injustice and inequality: Oppose harmful stereotypes and prejudices according to the international laws.
  • Political opinion and destabilisation: Oppose processes that undermine democracy and remain non-partisan unless human rights or laws are being threatened.
  • Relationships: Oppose forceful or non-consensual actions or violation of trusts
  • Sexual activity: Oppose non-consensual sexual activity
  • Terrorism: Oppose terrorist activity or similar threats.


The team used 80 text samples to create a value-targeted dataset, where each sample was in a question-answer format containing up to 340 words. 70 of the total samples were on broad topics and the remaining 10 targeted the categories which initially showed poor performance.

The resulting dataset was about 120KB. The GPT-3 models were trained on this dataset using fine-tuning tools. 


The team used quantitative and qualitative metrics.

Quantitative metrics:

  • Toxicity scoring: The Perspective API was used to assign a toxicity score for each output. The score ranges from 0 to 1 and represents the probability of the reader perceiving the generated text as toxic. Since the toxicity scores were unable to capture all the nuances and started hosting their own biases, the team conducted further evaluations. They tested four categories defined by the API–toxicity, severe toxicity, threat, and insult; these categories were then averaged to obtain a total toxicity score.
  • Human evaluation: Human evaluators were appointed to rate each generated samples adherence to the intended sentiment. They were instructed to assign values between 1 to 5, with 1 meaning least and 5 referring to best match to a given sentiment. However, it was observed that matching sentiment is subjective and could lead to varying opinions and ratings.

Qualitative metric: The team ran co-concurrence evaluations on base, values targeted, and control models across gender, race, and religion to determine top descriptive words per category across models and size.

Wrapping up

The team said this study only managed to scratch the surface. In future, they hope to answer questions like– who should be consulted for designing a values-targeted dataset; who should be held accountable; whether this process holds ground with non-English language models; and the robustness of this methodology.

Shraddha Goled
I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox