Last updated February 28, 2024
In AI Mysteries

Creating A Paragraph Auto Generator Using GPT2 Transformers

Share

Published on June 25, 2021

by Victor Dey

Automation is the process of making a system operate automatically without human intervention. As the days pass by and life gets busier each day, automation and automated systems intrigue the commune more. By applying automation, unsafe and repetitive tasks could be made self-sufficient, saving two of the most quintessential and precious things in a human’s life, Time and Money. With many tasks these days being labour-intensive and time-consuming, the creation of automated systems has improved efficiency and led to greater quality control. Although Automation may or may not be completely based on Artificial Intelligence, with the rise of automation and artificial intelligence simultaneously in the last decade, the use of automation collaborating with artificial intelligence might just be the next big thing to ponder upon. One of the most breakthrough discoveries in recent times for automation using artificial intelligence is AI Natural Language Generation.

What is Natural Language Generation?

Natural Language Generation, also known as NLG, uses artificial intelligence to produce written or spoken text content. It is a subsidiary of artificial intelligence and is a process that automatically transforms input data into plain-English content. The fascinating thing about NLG is that the technology can help tell a story using human-like creativity and intelligence, writing long sentences and paragraphs for you. Some of the uses of NLG are to generate product or service descriptions, content curation, creating portfolio summaries, or being used in customer communications through certain implementations in chatbots. Natural-language generation can be a bit complicated and require layers of language knowledge to work. These days, NLG is being integrated into tools to help with content strategy quickly, hence increasing productivity.

About Hugging Face

Hugging Face is an NLP focused startup that shares a large open-source community and provides an open-source library for Natural Language Processing. Their core mode of operation for natural language processing revolves around the use of Transformers. This python based library exposes an API to use many well-known architectures that help obtain the state of the art results for various NLP tasks like text classification, information extraction, question answering, and text generation. All the architectures provided come with a set of pre-trained weights utilizing deep learning that help with ease of operation for such tasks. These transformer models come in different shape and size architectures and have their ways of accepting input data tokenization. A tokenizer takes an input word and encodes the word into a number, thus allowing faster processing.

Getting Started with Creating a Paragraph Auto Generator

This article will try to implement a natural language generator that generates paragraphs from a single line of input text. For that, we will first set up all our dependencies using Hugging Face transformers for Natural Language Processing, then load our GPT2 model. This pre-trained model generates coherent paragraphs of text, encodes our input, and decodes our output to generate a paragraph.

So let’s get started with it!

The following code implementation is inspired by the official implementation, whose video tutorial you can find here.

Installing our Libraries

The first step would be to install our dependent libraries for this model. To do this, we will first install the Hugging Face Transformers. You can install Transformers by using the following command :

!pip install transformers #install the library from hugging face

Next, we will import our GPT2 Model and Tokenizer, collaborating with Tensorflow.

import tensorflow as tf
from transformers import GPT2LMHeadModel, GPT2Tokenizer #importing the main model and tokenizer

We will first encode the input sentence into tokens using the tokenizer, then generate a new sequence of tokens from the GPT-2 model and then decode the generated tokens into a sequence of words using the tokenizer again, which will provide us with our output.

Loading our Model

Create a new variable for the tokenizer and passing it through the GPT parameter.

tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large")#using the large parameter from GPT to generate larger texts

Instantiate the pre-trained model and padding with the tokenizer.

model=GPT2LMHeadModel.from_pretrained("gpt2large",pad_token_id=tokenizer.eos_token_id)

Testing the model by tokenizing our First sentence

Now that the model has been created, we will test it by providing our first input sentence to tokenize.

sentence = 'You will always succeed in Life' #input sentence

Encode it into a sequence of numbers and return them as PyTorch tensors.

input_ids = tokenizer.encode(sentence, return_tensors='pt')#using pt to return as pytorch tensors

Checking current progress

input_ids # checking the tesors returned

We will get the following output as number representation,

tensor([[1639, 481, 1464, 6758, 287, 5155]])

Decoding the text and Generating the Output

Creating a new variable called output to decode and setting our hyperparameters,

output = model.generate(input_ids, max_length=50, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)

With this line, we have called the input and set the maximum length of the paragraph to be generated as 50 words. We are also using a beam search technique to find the most appropriate word to be generated from the input sentence. We have also set no-repeat ngram as 2, which will prevent our model from repeating similar words more than twice and early stopping as true so that when the model does not find appropriate words, it stops the generation process.

Printing our results :

print(tokenizer.decode(output[0], skip_special_tokens=True))#printing results

We got the following output:

 You will always succeed in Life, but you will never be successful in Death."
 "I am not afraid of death, because I know that I am going to be with you when you die. I will be waiting for you, and I.

Cross validating our Model

We can also do the same and tune our hyperparameters to generate larger paragraphs with a new sentence. Beware this may take a longer time to generate output.

 sentence = 'Artificial intelligence is the key'
 input_ids = tokenizer.encode(sentence, return_tensors='pt')
 output = model.generate(input_ids, max_length=500, num_beams=5, no_repeat_ngram_size=2, early_stopping=True) #setting length as 500 to generate larger output text
 print(tokenizer.decode(output[0], skip_special_tokens=True))

We will get the following as output :

 Artificial intelligence is the key to unlocking the mysteries of the universe, but it's also the source of a lot of our problems.
 In a new paper published in the journal Science Advances, a team of researchers from the University of California, Berkeley, and the National Institute of Standards and Technology 
 (NIST) in Gaithersburg, Maryland, describes a way to create an artificial intelligence (AI) system that can learn from its mistakes and improve its performance over time. 
 The system, which they call a "neural network," is capable of learning to recognize patterns in images, recognize objects in a video, or even learn how to play a musical instrument. 
 In the paper, the researchers describe how they created the neural network and how it can be used to train an AI system to perform a variety of tasks, 
 such as recognizing objects and playing musical instruments.
 Neural networks, also known as deep neural networks or deep learning, are a type of machine learning algorithm that is based on the idea that a network of neurons 
 is like a computer's processor. Each neuron is connected to a number of other neurons to form a larger network. When a neuron receives an input, 
 it sends a signal to the next neuron, who in turn sends an output to another neuron. This process continues until all the neurons have received the input and have processed it. 
 As a result, each neuron has its own unique set of inputs and outputs, making it possible for the network to learn and adapt to changes in its environment. 
 Neural networks have been used for decades to solve a wide range of problems,including image recognition, speech recognition and natural language processing. 
 However, they have also been criticized for their poor performance when it comes to learning from their own mistakes. 
 For example, in 2013, researchers at Google's DeepMind AI research lab published a paper in Nature that showed that they were unable to improve the performance
  of their network when they made a series of mistakes while training it on images of human faces. They also found that the system was not able to distinguish between 
 a human face and a dog face, even though the images were similar in terms of size and shape. These problems have led some researchers to argue that neural nets are not 
 as effective as they are made out to be. But the new study suggests that this is not necessarily the case. "We show that it is possible to build a neural net that learns
 from mistakes," said lead author and UC Berkeley professor of electrical engineering and computer.

We can clearly notice the difference through hyperparameter tuning this time!

You can save it as a text file using the following lines of code :

 text = tokenizer.decode(output[0],skip_special_tokens = True)
 with open('AIBLOG.txt','w') as f:
   f.write(text)

EndNotes

We have now learned how to create a model to generate long lines of text from a single sentence utilizing AI and Hugging Face Library by performing the following steps. You can tune the hyperparameters further to make the model more intelligent to provide better text content. The full Colab file for the following can be accessed from here.

Happy Learning!

References

Access all our open Survey & Awards Nomination forms in one place

Victor Dey

Victor is an aspiring Data Scientist & is a Master of Science in Data Science & Big Data Analytics. He is a Researcher, a Data Science Influencer and also an Ex-University Football Player. A keen learner of new developments in Data Science and Artificial Intelligence, he is committed to growing the Data Science community.