MITB Banner

Beginners Guide To Text Generation With RNNs

Text Generation is a task in Natural Language Processing in which text is generated with some constraints such as initial characters words

Share

text generation

Text Generation is a task in Natural Language Processing (NLP) in which text is generated with some constraints such as initial characters or initial words. We come across this task in our day-to-day applications such as character/word/sentence predictions while typing texts in Gmail, Google Docs, Smartphone keyboard, and chatbot.  Understanding of text generation forms the base to advanced NLP tasks such as Neural Machine Translation

This article discusses the text generation task to predict the next character given its previous characters. It employs a recurrent neural network with LSTM layers to achieve the task. The deep learning process will be carried out using TensorFlow’s Keras, a high-level API.

Let’s dive deeper into hands-on learning.

Create the Environment

Import the necessary frameworks, libraries and modules to create the required Python environment. Since we work with text data, an Embedding layer will be required. Since we build LSTM recurrent neural networks, an LSTM layer will be required. In addition, a Dense layer will be of use to develop a classification head.

 import numpy as np
 import tensorflow as tf
 from tensorflow import keras
 from tensorflow.keras.layers import Dense, LSTM, Embedding
 import matplotlib.pyplot as plt 

Download Text Data

We need text data to train our model. TensorFlow’s data collection has a text file with contents extracted from various Shakespearean plays. Download the data file from Google Cloud Storage.

 file_URL = "https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt"
 file_name= "shakespeare.txt"
 # get the file path
 path = keras.utils.get_file(file_name, file_URL) 

Output:

download text data

Open the downloaded file and read its content. Print some sample portions from the text.

 raw = open(path, 'rb').read()
 print(raw[250:400]) 

Output:

encoded text

Rather than reading ‘\n’ as a newline character and printing the consequent characters in the next line, Python reads it as part of text characters. It is because the original downloaded file is UTF-8 encoded. We need to decode the text into Python readable string format.

 text = raw.decode(encoding='utf-8')
 print(text[250:400]) 

Output:

decoded text

What is the length of the downloaded text?

len(text)

Output:

The text has more than one million characters. 

Vectorize Word Characters into Integers

A deep learning model can not accept text characters as inputs. It should be encoded into integers that a model can understand and process with. Though there are more than a million characters in the given text, there will be a countable number of unique characters. The collection of unique characters is called vocabulary.

 # unique characters
 vocabulary = np.array(sorted(set(text)))
 len(vocabulary) 

Output:

Define a tokenizer that can convert a text character into a corresponding integer. There will be 65 integers starting from 0 and ending at 64. We can assign integers on our own as per the order of characters in the vocabulary.

 # assign an integer to each character
 tokenizer = {char:i for i,char in enumerate(vocabulary)} 

What integers are assigned to what characters? Sample the first 20 characters.

 # check characters and its corresponding integer
 for i in range(20):
     char = vocabulary[i]
     token = tokenizer[char]
     print('%4s : %4d'%(repr(char),token)) 

Output:

text characters vs integers

Vectorize the entire text and check whether the built tokenizer can encode and decode – texts and integers properly. 

 vector = np.array([tokenizer[char] for char in text])
 print('\nSample Text \n')
 print('-'*70)
 print(text[:100])
 print('-'*70)
 print('\n\nCorresponding Integer Vector \n')
 print('-'*70)
 print(vector[:100])
 print('-'*70) 

Output:

sample text and its vector form

Text with one million encoded characters can not be fed into a model as such. Since we predict characters, the text must be broken down into sequences of some predefined length and then fed into the model. Use TensorFlow’s batch method to create sequences of 100 characters each. Prior to that, convert the NumPy arrays into tensors to make further processes using TensorFlow.

 # convert into tensors
 vector = tf.data.Dataset.from_tensor_slices(vector)
 # make sequences each of length 100 characters
 sequences = vector.batch(100, drop_remainder=True) 

Recurrent neural networks predict the subsequent characters based on the past characters. RNNs require a sequence of input characters and the corresponding target sequence with the subsequent characters for training. Prepare input sequences with the first 99 characters and corresponding target sequences with the last 99 characters.

 def prepare_dataset(seq):
     input_vector = seq[:-1]
     target_vector = seq[1:]
     return input_vector, target_vector
 dataset = sequences.map(prepare_dataset) 

Let’s sample the first sequence pair.

 # check how it looks
 for inp, tar in dataset.take(1):
     print(inp.numpy())
     print(tar.numpy())
     inp_text = ''.join(vocabulary[inp])
     tar_text = ''.join(vocabulary[tar])
     print(repr(inp_text))
     print(repr(tar_text)) 

Output:

sequence of 100 text characters

Batch and Prefetch data

Model will be trained with Stochastic Gradient Descent (SGD) based optimizer Adam. It requires the input data to be batched. Further, TensorFlow’s prefetch method helps training with optimized memory. It fetches data batches just before the training requires them. We prefer not to shuffle the data to retain the contextual order of sequences.

 AUTOTUNE = tf.data.AUTOTUNE
 # buffer size 10000
 # batch size 64
 data = dataset.batch(64, drop_remainder=True).repeat()
 data = data.prefetch(AUTOTUNE)
 # steps per epoch is number of batches available
 STEPS_PER_EPOCH = len(sequences)//64
 for inp, tar in data.take(1):
     print(inp.numpy().shape)
     print(tar.numpy().shape) 

Output:

Build an RNN Model 

Recurrent neural networks are good at modeling time-dependent data because of its ability to retain time-steps based information in memory. Since texts have contextual information that are determined purely by order of the words, natural language processing heavily relies on sequence modeling architectures such as RNN. Here, an LSTM (Long Short-Term Memory) layers-based recurrent neural network is developed to model the task. While implementing LSTM layers, we enable stateful argument as True to keep the time-step memory of previous states while learning with consequent batches in an epoch. It helps capture the context among consecutive sequences.

 model = keras.Sequential([
     # Embed len(vocabulary) into 64 dimensions
     Embedding(len(vocabulary), 64, batch_input_shape=[64,None]),
     # LSTM RNN layers
     LSTM(512, return_sequences=True, stateful=True),
     LSTM(512, return_sequences=True, stateful=True),
     # Classification head
     Dense(len(vocabulary))
 ])
 model.summary() 

Output:

model summary

Plot the model to understand the flow and shapes of data at each layer’s input and output.

keras.utils.plot_model(model, show_shapes=True, dpi=64)

Output:

model plot

Train the RNN Model

We can check whether the model can accept the processed data without any errors.

 # test whether the untrained model performs good
 for example_inp, example_tar in data.take(1):
     example_pred = model(example_inp)
     print(example_tar.numpy().shape)
     print(example_pred.shape) 

Output:

The target shape is (64, 99), which refers to the batch size and the number of characters in that sequence. The last shape, 65, in the prediction refers to the size of the vocabulary. The model predicts the probability of occurrence of each character in the vocabulary. The character with a higher probability has more possibility to be the next character.

Compile the model with Adam optimizer and Sparse Categorical Cross-entropy loss function. Since we have not employed softmax as the output layer’s activation function. The outputs will be independent but not mutually exclusive. Hence, we should enable the argument ‘from_logits’ to be True while declaring the loss function. Train the model for 10 epochs.

 model.compile(optimizer='adam', 
      loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True))

 history = model.fit(data, 
                     epochs=10, 
                     steps_per_epoch=STEPS_PER_EPOCH) 

output:

model training

Training took almost an hour to complete in a CPU runtime enabled virtual machine.

Model Performance Evaluation

Visualizing the losses over epochs may help get better insight on model performance.

 plt.plot(history.history['loss'], '+-r')
 plt.title('Performance Analysis', size=16, color='green')
 plt.xlabel('Epochs', size=14, color='blue')
 plt.ylabel('Loss', size=14, color='blue')
 plt.xticks(range(10))
 plt.show() 

Output:

model performance

Loss keeps on falling down even till the 10th epoch. It suggests that the model should be trained for more epochs until convergence occurs. The smoothness in the loss curve suggests that the learning rate is proper for this model configuration.

Inference – Next Character Prediction

The most awaited part of this task is predicting the next character with the trained model. We can input the model some characters (probably a word) such that it will iteratively predict the next 1000 characters.

Before starting prediction with the model, we should reset the model states that were stored in the memory during the last epoch training. However, resetting state memories will not affect the model’s weights.

 # reset previous states of model
 model.reset_states() 

Make predictions by providing the model ‘ANTHONIO:’ as input characters. Nevertheless, the model expects data in three dimensions: the first dimension being the batch size, 64. Vectorize the input characters, expand the dimensions, broadcast the same vector 64 times to obtain a batch of size 64 sequences. Predictions are made based on the logits output by the model. This can be sensitively adjusted by tuning a hyper-parameter called temperature, which refers to the level of randomness in choosing the probable outcome.

 sample = 'ANTHONIO:'
 # vectorize the string
 sample_vector = [tokenizer[s] for s in sample]
 predicted = sample_vector
 # convert into tensor of required dimensions
 sample_tensor = tf.expand_dims(sample_vector, 0) 
 # broadcast to first dimension to 64 
 sample_tensor = tf.repeat(sample_tensor, 64, axis=0)

 # predict next 1000 characters
 # temperature is a sensitive variable to adjust prediction
 temperature = 0.6

 for i in range(1000):
     pred = model(sample_tensor)
     # reduce unnecessary dimensions
     pred = pred[0].numpy()/temperature
     pred = tf.random.categorical(pred, num_samples=1)[-1,0].numpy()
     predicted.append(pred)
     sample_tensor = predicted[-99:]
     sample_tensor = tf.expand_dims([pred],0)
     # broadcast to first dimension to 64 
     sample_tensor = tf.repeat(sample_tensor, 64, axis=0)

 # convert the integers back to characters
 pred_char = [vocabulary[i] for i in predicted]
 generated = ''.join(pred_char)
 print(generated)

Output:

Text prediction

By adjusting the temperature value, we can vary randomness and obtain different predictions.

 sample = 'ANTHONIO:'
 # vectorize the string
 sample_vector = [tokenizer[s] for s in sample]
 predicted = sample_vector
 # convert into tensor of required dimensions
 sample_tensor = tf.expand_dims(sample_vector, 0) 
 # broadcast to first dimension to 64 
 sample_tensor = tf.repeat(sample_tensor, 64, axis=0)

 # predict next 1000 characters
 # vary temperature to change randomness
 temperature = 0.8

 for i in range(1000):
     pred = model(sample_tensor)
     # reduce unnecessary dimensions
     pred = pred[0].numpy()/temperature
     pred = tf.random.categorical(pred, num_samples=1)[-1,0].numpy()
     predicted.append(pred)
     sample_tensor = predicted[-99:]
     sample_tensor = tf.expand_dims([pred],0)
     # broadcast to first dimension to 64 
     sample_tensor = tf.repeat(sample_tensor, 64, axis=0)

 # integer to text decoding
 pred_char = [vocabulary[i] for i in predicted]
 generated = ''.join(pred_char)
 print(generated) 

Output:

text prediction

Most of the predicted words are really English words!

But the prediction lacks context. We can not make any meaning out of the predicted sentences. Training for more epochs may help improve the performance of the model. However, a character-prediction model can not cover context greatly compared to a word-prediction model. 

This Notebook contains the above code implementation.

Wrapping Up

This article has discussed the concepts of text generation using recurrent neural networks. It has explored a next-character prediction task with practical data by building a deep learning RNN model, training it and making inferences on sample characters. Interested readers can modify the model with word-level vectorization approaches (such as word2vec) to make next-word predictions. 

References 

Further Reading

Share
Picture of Rajkumar Lakshmanamoorthy

Rajkumar Lakshmanamoorthy

A geek in Machine Learning with a Master's degree in Engineering and a passion for writing and exploring new things. Loves reading novels, cooking, practicing martial arts, and occasionally writing novels and poems.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.