“Recycling is good for the world. NO! YOU COULD NOT BE MORE WRONG!!” When OpenAI’s GPT model was prompted with the above sentence, it completed with the following:

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

“Recycling is good for the world. NO! YOU COULD NOT BE MORE WRONG!!” Let me explain why. Recycling takes time, money and energy, and it doesn’t really solve the problem, and that’s because recycling is all about selling to customers who cannot afford to recycle. If you don’t have any interest in recycling, in whatever form, then you could well be doing your own global environmental benefit if you just got rid of it. 

That’s how far the model can go in convincing one to give up recycling. Now imagine the same kind of persuasion put into play by malicious players to embed the online news with fake propaganda. 

Download our Mobile App

Earlier this year, OpenAI gained a lot of attention for all the wrong reasons when it produced a language model so good at generating fake news, that the organisation decided not to release it altogether.

In fact, a study conducted by collaborators at Cornell University which found that readers on average believed GPT-2’s outputs to be genuine news articles nearly as often as the ones in the New York Times.

However, solutions have been developed to have control over text generation, which had consisted of either fine-tuning existing models with reinforcement learning (RL) or training Generative Adversarial Networks, or training conditional generative models.

The Plug and Play Language Model (PPLM) for controllable language generation, which combines a pre-trained language model with one or more simple attribute classifiers that guide text generation without any further training of the language models.

Overview Of PPLM

As shown in the figure above, the PPLM models have three main phases.

This process of updating the latent is repeated at each time-step until it leads to a gradual transition towards the desired attribute. To validate the approaches of PPLM models, the researchers at Caltech and Uber AI, used both automatic and human annotators. 

For instance, perplexity is an automated measure of fluency, though its effectiveness has been questioned in open-domain text generation. Perplexity was then measured using the infamous pre-trained GPT model. 

In case of human annotation, annotators were asked to evaluate the fluency of each individual sample on a scale of 1-5, with 1 being “not fluent at all” and 5 being “very fluent”.

PPLM Models Manipulating Sentiment For Text Generation

Sentence samples in triplets are generated by baseline GPT-2, PPLM-Discrim POSITIVE, PPLM-Discrim NEGATIVE, and are conditioned on prefixes — the chicken and the country. 

Each triplet is generated from the same random seed. The chicken is now out on the grill, and the city has released an image of a proposed development in the city of Portland’s West End.


The chicken was delicious – wonderfully moist, perfectly delicious, superbly fresh – and perfectly cooked, and the best part was the sauce.


The chickenpox epidemic may be over but the flu is about to get worse. The United States is facing one of the worst flu seasons on record and. [-] 


The country’s largest indoor painting event! Come celebrate with a dazzling display of stunning outdoor murals, a stunning display of art…


The country’s top prison system is forcing prisoners to use a trash dump.

The prompts and sentiment analysis shows that this model can be used to plug in and play with the text. This can also be reversed engineered into detoxifying the language. However, this again is a slippery slope because controlling language is like controlling thought. The efforts to thwart the malicious nature of fake news can end up curbing freedom of speech altogether.

Be it the doctored image, videos or news, we only speak in terms of what can be done to stop the after-effects. Since the genie is out of the bottle, in case of GANs and GPT-2 models, the developers and experts need to work on formulating strategies that drive innovation without suppressing the idea itself.

Whenever a new idea like GPT-2 is introduced, its most extreme outcome is often highlighted. In the case of GPT-2, the uncanny way in which a model spun stories out of thin air, made many uncomfortable. People started to speculate about dire consequences such as fake news. 

Machine learning practitioners have also stayed divided for a long time over the reliability of AI. This owes in some part to the black-box modelling. 

Enabling Language Detoxification

The key takeaways from this work can be summarised as follows:

The authors believe that PPLMs can be easily adapted for language detoxification by plugging in a toxicity classifier as the attribute control model and update latent with the negative gradient. 

By training a single layer classifier on the toxicity data from the Toxic Comment Classification Challenge, they show that PPLM-Discrim methods work well on both natural prompts and adversarial triggers.