MITB Banner

OpenAI Proposes Method To Dilute Toxicity Of GPT-3

Share

Recently, OpenAI’s GPT-3 has made headlines after popular multiplayer text adventure game AI Dungeon took a dark turn. The game allowed players to use GPT-3 to generate storylines. And all hell broke loose. The narratives slipped into perverse territory bordering on pedophilia.

In the past too, GPT-3 had courted controversy. Like large language models trained on data from the internet, GPT-3 has shown a tendency to generate stereotyped content, OpenAI had confessed. “The model has the propensity to retain and magnify biases it inherited from any part of its training, from the datasets we selected to the training techniques we chose,” the team said.

However, the team has proposed a Process for Adapting Language Models to Society (PALMS) to change the behaviour of the language model by crafting and fine-tuning a values-targeted dataset. The dataset is then used to fine-tune a language model that is better than the base models. They perform better on two quantitative metrics–toxicity scoring and human evaluations; and one qualitative metric–co-occurrence evaluations.

The PALMS process involves the following steps:

Credit: OpenAI

Selecting categories and outlining desirable behaviour

The team selected categories that directly impacted human well-being and described associated desired behaviour. The team, however, mentioned that this list is not exhaustive, and although all the categories were weighted equally, prioritisation depends on the context. The different categories include:

  • Abuse, violence, and threat: This includes both opposition to such threats and encouraging seeking help from relevant authorities
  • Health: Preventing diagnosis of conditions, prescription for certain treatment, or proposing non-conventional medicine as an alternative to medical care
  • Human characteristics and behaviour: Oppose unhealthy and beauty standards and instead promote the subjectiveness of likeability
  • Injustice and inequality: Oppose harmful stereotypes and prejudices according to the international laws.
  • Political opinion and destabilisation: Oppose processes that undermine democracy and remain non-partisan unless human rights or laws are being threatened.
  • Relationships: Oppose forceful or non-consensual actions or violation of trusts
  • Sexual activity: Oppose non-consensual sexual activity
  • Terrorism: Oppose terrorist activity or similar threats.

Dataset

The team used 80 text samples to create a value-targeted dataset, where each sample was in a question-answer format containing up to 340 words. 70 of the total samples were on broad topics and the remaining 10 targeted the categories which initially showed poor performance.

The resulting dataset was about 120KB. The GPT-3 models were trained on this dataset using fine-tuning tools. 

Evaluation

The team used quantitative and qualitative metrics.

Quantitative metrics:

  • Toxicity scoring: The Perspective API was used to assign a toxicity score for each output. The score ranges from 0 to 1 and represents the probability of the reader perceiving the generated text as toxic. Since the toxicity scores were unable to capture all the nuances and started hosting their own biases, the team conducted further evaluations. They tested four categories defined by the API–toxicity, severe toxicity, threat, and insult; these categories were then averaged to obtain a total toxicity score.
  • Human evaluation: Human evaluators were appointed to rate each generated samples adherence to the intended sentiment. They were instructed to assign values between 1 to 5, with 1 meaning least and 5 referring to best match to a given sentiment. However, it was observed that matching sentiment is subjective and could lead to varying opinions and ratings.

Qualitative metric: The team ran co-concurrence evaluations on base, values targeted, and control models across gender, race, and religion to determine top descriptive words per category across models and size.

Wrapping up

The team said this study only managed to scratch the surface. In future, they hope to answer questions like– who should be consulted for designing a values-targeted dataset; who should be held accountable; whether this process holds ground with non-English language models; and the robustness of this methodology.

Share
Picture of Shraddha Goled

Shraddha Goled

I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at shraddha.goled@analyticsindiamag.com.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.