Text Generation is a task in Natural Language Processing (NLP) in which text is generated with some constraints such as initial characters or initial words. We come across this task in our day-to-day applications such as character/word/sentence predictions while typing texts in Gmail, Google Docs, Smartphone keyboard, and chatbot. Understanding of text generation forms the base to advanced NLP tasks such as Neural Machine Translation.
This article discusses the text generation task to predict the next character given its previous characters. It employs a recurrent neural network with LSTM layers to achieve the task. The deep learning process will be carried out using TensorFlow’s Keras, a high-level API.
Let’s dive deeper into hands-on learning.
Create the Environment
Import the necessary frameworks, libraries and modules to create the required Python environment. Since we work with text data, an Embedding layer will be required. Since we build LSTM recurrent neural networks, an LSTM layer will be required. In addition, a Dense layer will be of use to develop a classification head.
import numpy as np import tensorflow as tf from tensorflow import keras from tensorflow.keras.layers import Dense, LSTM, Embedding import matplotlib.pyplot as plt
Download Text Data
We need text data to train our model. TensorFlow’s data collection has a text file with contents extracted from various Shakespearean plays. Download the data file from Google Cloud Storage.
file_URL = "https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt" file_name= "shakespeare.txt" # get the file path path = keras.utils.get_file(file_name, file_URL)
Output:
Open the downloaded file and read its content. Print some sample portions from the text.
raw = open(path, 'rb').read() print(raw[250:400])
Output:
Rather than reading ‘\n’ as a newline character and printing the consequent characters in the next line, Python reads it as part of text characters. It is because the original downloaded file is UTF-8 encoded. We need to decode the text into Python readable string format.
text = raw.decode(encoding='utf-8') print(text[250:400])
Output:
What is the length of the downloaded text?
len(text)
Output:
The text has more than one million characters.
Vectorize Word Characters into Integers
A deep learning model can not accept text characters as inputs. It should be encoded into integers that a model can understand and process with. Though there are more than a million characters in the given text, there will be a countable number of unique characters. The collection of unique characters is called vocabulary.
# unique characters vocabulary = np.array(sorted(set(text))) len(vocabulary)
Output:
Define a tokenizer that can convert a text character into a corresponding integer. There will be 65 integers starting from 0 and ending at 64. We can assign integers on our own as per the order of characters in the vocabulary.
# assign an integer to each character tokenizer = {char:i for i,char in enumerate(vocabulary)}
What integers are assigned to what characters? Sample the first 20 characters.
# check characters and its corresponding integer for i in range(20): char = vocabulary[i] token = tokenizer[char] print('%4s : %4d'%(repr(char),token))
Output:
Vectorize the entire text and check whether the built tokenizer can encode and decode – texts and integers properly.
vector = np.array([tokenizer[char] for char in text]) print('\nSample Text \n') print('-'*70) print(text[:100]) print('-'*70) print('\n\nCorresponding Integer Vector \n') print('-'*70) print(vector[:100]) print('-'*70)
Output:
Text with one million encoded characters can not be fed into a model as such. Since we predict characters, the text must be broken down into sequences of some predefined length and then fed into the model. Use TensorFlow’s batch method to create sequences of 100 characters each. Prior to that, convert the NumPy arrays into tensors to make further processes using TensorFlow.
# convert into tensors vector = tf.data.Dataset.from_tensor_slices(vector) # make sequences each of length 100 characters sequences = vector.batch(100, drop_remainder=True)
Recurrent neural networks predict the subsequent characters based on the past characters. RNNs require a sequence of input characters and the corresponding target sequence with the subsequent characters for training. Prepare input sequences with the first 99 characters and corresponding target sequences with the last 99 characters.
def prepare_dataset(seq): input_vector = seq[:-1] target_vector = seq[1:] return input_vector, target_vector dataset = sequences.map(prepare_dataset)
Let’s sample the first sequence pair.
# check how it looks for inp, tar in dataset.take(1): print(inp.numpy()) print(tar.numpy()) inp_text = ''.join(vocabulary[inp]) tar_text = ''.join(vocabulary[tar]) print(repr(inp_text)) print(repr(tar_text))
Output:
Batch and Prefetch data
Model will be trained with Stochastic Gradient Descent (SGD) based optimizer Adam. It requires the input data to be batched. Further, TensorFlow’s prefetch method helps training with optimized memory. It fetches data batches just before the training requires them. We prefer not to shuffle the data to retain the contextual order of sequences.
AUTOTUNE = tf.data.AUTOTUNE # buffer size 10000 # batch size 64 data = dataset.batch(64, drop_remainder=True).repeat() data = data.prefetch(AUTOTUNE) # steps per epoch is number of batches available STEPS_PER_EPOCH = len(sequences)//64 for inp, tar in data.take(1): print(inp.numpy().shape) print(tar.numpy().shape)
Output:
Build an RNN Model
Recurrent neural networks are good at modeling time-dependent data because of its ability to retain time-steps based information in memory. Since texts have contextual information that are determined purely by order of the words, natural language processing heavily relies on sequence modeling architectures such as RNN. Here, an LSTM (Long Short-Term Memory) layers-based recurrent neural network is developed to model the task. While implementing LSTM layers, we enable stateful argument as True to keep the time-step memory of previous states while learning with consequent batches in an epoch. It helps capture the context among consecutive sequences.
model = keras.Sequential([ # Embed len(vocabulary) into 64 dimensions Embedding(len(vocabulary), 64, batch_input_shape=[64,None]), # LSTM RNN layers LSTM(512, return_sequences=True, stateful=True), LSTM(512, return_sequences=True, stateful=True), # Classification head Dense(len(vocabulary)) ]) model.summary()
Output:
Plot the model to understand the flow and shapes of data at each layer’s input and output.
keras.utils.plot_model(model, show_shapes=True, dpi=64)
Output:
Train the RNN Model
We can check whether the model can accept the processed data without any errors.
# test whether the untrained model performs good for example_inp, example_tar in data.take(1): example_pred = model(example_inp) print(example_tar.numpy().shape) print(example_pred.shape)
Output:
The target shape is (64, 99), which refers to the batch size and the number of characters in that sequence. The last shape, 65, in the prediction refers to the size of the vocabulary. The model predicts the probability of occurrence of each character in the vocabulary. The character with a higher probability has more possibility to be the next character.
Compile the model with Adam optimizer and Sparse Categorical Cross-entropy loss function. Since we have not employed softmax as the output layer’s activation function. The outputs will be independent but not mutually exclusive. Hence, we should enable the argument ‘from_logits’ to be True while declaring the loss function. Train the model for 10 epochs.
model.compile(optimizer='adam', loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True)) history = model.fit(data, epochs=10, steps_per_epoch=STEPS_PER_EPOCH)
output:
Training took almost an hour to complete in a CPU runtime enabled virtual machine.
Model Performance Evaluation
Visualizing the losses over epochs may help get better insight on model performance.
plt.plot(history.history['loss'], '+-r') plt.title('Performance Analysis', size=16, color='green') plt.xlabel('Epochs', size=14, color='blue') plt.ylabel('Loss', size=14, color='blue') plt.xticks(range(10)) plt.show()
Output:
Loss keeps on falling down even till the 10th epoch. It suggests that the model should be trained for more epochs until convergence occurs. The smoothness in the loss curve suggests that the learning rate is proper for this model configuration.
Inference – Next Character Prediction
The most awaited part of this task is predicting the next character with the trained model. We can input the model some characters (probably a word) such that it will iteratively predict the next 1000 characters.
Before starting prediction with the model, we should reset the model states that were stored in the memory during the last epoch training. However, resetting state memories will not affect the model’s weights.
# reset previous states of model model.reset_states()
Make predictions by providing the model ‘ANTHONIO:’ as input characters. Nevertheless, the model expects data in three dimensions: the first dimension being the batch size, 64. Vectorize the input characters, expand the dimensions, broadcast the same vector 64 times to obtain a batch of size 64 sequences. Predictions are made based on the logits output by the model. This can be sensitively adjusted by tuning a hyper-parameter called temperature, which refers to the level of randomness in choosing the probable outcome.
sample = 'ANTHONIO:' # vectorize the string sample_vector = [tokenizer[s] for s in sample] predicted = sample_vector # convert into tensor of required dimensions sample_tensor = tf.expand_dims(sample_vector, 0) # broadcast to first dimension to 64 sample_tensor = tf.repeat(sample_tensor, 64, axis=0) # predict next 1000 characters # temperature is a sensitive variable to adjust prediction temperature = 0.6 for i in range(1000): pred = model(sample_tensor) # reduce unnecessary dimensions pred = pred[0].numpy()/temperature pred = tf.random.categorical(pred, num_samples=1)[-1,0].numpy() predicted.append(pred) sample_tensor = predicted[-99:] sample_tensor = tf.expand_dims([pred],0) # broadcast to first dimension to 64 sample_tensor = tf.repeat(sample_tensor, 64, axis=0) # convert the integers back to characters pred_char = [vocabulary[i] for i in predicted] generated = ''.join(pred_char) print(generated)
Output:
By adjusting the temperature value, we can vary randomness and obtain different predictions.
sample = 'ANTHONIO:' # vectorize the string sample_vector = [tokenizer[s] for s in sample] predicted = sample_vector # convert into tensor of required dimensions sample_tensor = tf.expand_dims(sample_vector, 0) # broadcast to first dimension to 64 sample_tensor = tf.repeat(sample_tensor, 64, axis=0) # predict next 1000 characters # vary temperature to change randomness temperature = 0.8 for i in range(1000): pred = model(sample_tensor) # reduce unnecessary dimensions pred = pred[0].numpy()/temperature pred = tf.random.categorical(pred, num_samples=1)[-1,0].numpy() predicted.append(pred) sample_tensor = predicted[-99:] sample_tensor = tf.expand_dims([pred],0) # broadcast to first dimension to 64 sample_tensor = tf.repeat(sample_tensor, 64, axis=0) # integer to text decoding pred_char = [vocabulary[i] for i in predicted] generated = ''.join(pred_char) print(generated)
Output:
Most of the predicted words are really English words!
But the prediction lacks context. We can not make any meaning out of the predicted sentences. Training for more epochs may help improve the performance of the model. However, a character-prediction model can not cover context greatly compared to a word-prediction model.
This Notebook contains the above code implementation.
Wrapping Up
This article has discussed the concepts of text generation using recurrent neural networks. It has explored a next-character prediction task with practical data by building a deep learning RNN model, training it and making inferences on sample characters. Interested readers can modify the model with word-level vectorization approaches (such as word2vec) to make next-word predictions.
References
- Official Tutorial on RNN
- Official Tutorial on Text Generation
- TensorFlow Official Datasets
- Shakespeare plays text data