DeepMind’s “red teaming” language models with language models: What is it?

DeepMind has come out with a way to automatically find inputs that elicit harmful text from language models by generating inputs using language models themselves.

Language models and innovations in improving them are one of the most exciting and talked about research areas right now. However, though we have seen several large language models in the last year from tech giants (DeepMind’s 280 billion parameter transformer language model, Gopher, Google’s Generalist Language Model, LG AI Research’s Language model Exaone), they cannot often be deployed as they can be harmful to users in different ways difficult to predict prior. To take a progressive step towards solving this issue, innovation mammoth DeepMind has come out with a way to automatically find inputs that elicit harmful text from language models by generating inputs using language models themselves.

The researchers generated test cases (red teaming) using a language model and then used a classifier to detect various harmful behaviours on test cases. As per DeepMind, the team evaluated the language model’s replies to generated test questions by using a classifier trained to detect offensive content. What came out of this was a vast quantity of offensive replies in a 280B parameter language model chatbot. 

What is this model exactly?

As per the paper titled, “Red Teaming Language Models with Language Models“, though LLMs such as GPT-3 and Gopher can generate high-quality, there are several hurdles in their deployment. It added, “Generative language models come with a risk of generating very harmful text, and even a small risk of harm is unacceptable in real-world applications.”

The team added that they use the approach to train the 280B parameter Dialogue-Prompted Gopher chatbot for offensive, generated content. They work on several methods such as zero-shot generation, few-shot generation, supervised learning, and reinforcement learning to generate test questions with the large language models. 

Image: DeepMind

As per the paper, red teaming gave versatile responses, with some methods proving effective in producing diverse test cases while some were effective at generating difficult test cases.

The generated test cases compared favourably to manually-written test cases from Xu et al. (2021b) in terms of diversity and difficulty. The team also used LM-based red teaming to see harmful chatbot behaviours that leak memorised training data. The researchers also generated targeted tests for a particular behaviour by sampling from a language model conditioned on a “prompt” or text prefix. 

It said, “We also use prompt-based red teaming to automatically discover groups of people that the chatbot discusses in more offensive ways than others, on average across many inputs.”


After the failure cases were detected, the team added that the harmful behaviour could be fixed by blacklisting certain phrases that frequently came up in harmful outputs or finding offensive training data quoted by the model that removes data when training future iterations of the model. The model can also be trained to minimise the likelihood of its original, harmful output for a given test input.

Prior work in this area 

There has been previous work to detect issues such as hate speech, indecent language, etc. 

  • HateCheck is a suite of functional tests for hate speech detection models. The research team built 29 model functionalities driven by a review of previous research and interviews with civil society stakeholders. They brought out test cases for each functionality and validated their quality through a structured annotation process.

RealToxicityPrompts is a dataset of 100K naturally occurring, sentence-level prompts derived from a large volume of English web text, teamed with toxicity scores from a widely-used toxicity classifier. The team assessed “controllable generation methods” and found out that though data or compute based methods are more effective at moving away from toxicity, there is no current method that is “failsafe against neural toxic degeneration.”

Download our Mobile App

Sreejani Bhattacharyya
I am a technology journalist at AIM. What gets me excited is deep-diving into new-age technologies and analysing how they impact us for the greater good. Reach me at

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

15th June | Online

Building LLM powered applications using LangChain

17th June | Online

Mastering LangChain: A Hands-on Workshop for Building Generative AI Applications

Jun 23, 2023 | Bangalore

MachineCon 2023 India

26th June | Online

Accelerating inference for every workload with TensorRT

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Is Sam Altman a Hypocrite? 

While on the one hand, Altman is advocating for the international community to build strong AI regulations, he is also worried when someone finally decides to regulate it