Recently, OpenAI’s GPT-3 has made headlines after popular multiplayer text adventure game AI Dungeon took a dark turn. The game allowed players to use GPT-3 to generate storylines. And all hell broke loose. The narratives slipped into perverse territory bordering on pedophilia.
In the past too, GPT-3 had courted controversy. Like large language models trained on data from the internet, GPT-3 has shown a tendency to generate stereotyped content, OpenAI had confessed. “The model has the propensity to retain and magnify biases it inherited from any part of its training, from the datasets we selected to the training techniques we chose,” the team said.
However, the team has proposed a Process for Adapting Language Models to Society (PALMS) to change the behaviour of the language model by crafting and fine-tuning a values-targeted dataset. The dataset is then used to fine-tune a language model that is better than the base models. They perform better on two quantitative metrics–toxicity scoring and human evaluations; and one qualitative metric–co-occurrence evaluations.
The PALMS process involves the following steps:
Selecting categories and outlining desirable behaviour
The team selected categories that directly impacted human well-being and described associated desired behaviour. The team, however, mentioned that this list is not exhaustive, and although all the categories were weighted equally, prioritisation depends on the context. The different categories include:
- Abuse, violence, and threat: This includes both opposition to such threats and encouraging seeking help from relevant authorities
- Health: Preventing diagnosis of conditions, prescription for certain treatment, or proposing non-conventional medicine as an alternative to medical care
- Human characteristics and behaviour: Oppose unhealthy and beauty standards and instead promote the subjectiveness of likeability
- Injustice and inequality: Oppose harmful stereotypes and prejudices according to the international laws.
- Political opinion and destabilisation: Oppose processes that undermine democracy and remain non-partisan unless human rights or laws are being threatened.
- Relationships: Oppose forceful or non-consensual actions or violation of trusts
- Sexual activity: Oppose non-consensual sexual activity
- Terrorism: Oppose terrorist activity or similar threats.
The team used 80 text samples to create a value-targeted dataset, where each sample was in a question-answer format containing up to 340 words. 70 of the total samples were on broad topics and the remaining 10 targeted the categories which initially showed poor performance.
The resulting dataset was about 120KB. The GPT-3 models were trained on this dataset using fine-tuning tools.
The team used quantitative and qualitative metrics.
- Toxicity scoring: The Perspective API was used to assign a toxicity score for each output. The score ranges from 0 to 1 and represents the probability of the reader perceiving the generated text as toxic. Since the toxicity scores were unable to capture all the nuances and started hosting their own biases, the team conducted further evaluations. They tested four categories defined by the API–toxicity, severe toxicity, threat, and insult; these categories were then averaged to obtain a total toxicity score.
- Human evaluation: Human evaluators were appointed to rate each generated samples adherence to the intended sentiment. They were instructed to assign values between 1 to 5, with 1 meaning least and 5 referring to best match to a given sentiment. However, it was observed that matching sentiment is subjective and could lead to varying opinions and ratings.
Qualitative metric: The team ran co-concurrence evaluations on base, values targeted, and control models across gender, race, and religion to determine top descriptive words per category across models and size.
The team said this study only managed to scratch the surface. In future, they hope to answer questions like– who should be consulted for designing a values-targeted dataset; who should be held accountable; whether this process holds ground with non-English language models; and the robustness of this methodology.