Time and again, the value of transfer learning has proven to be of utmost importance to computer vision researchers, like pre-training a neural network like VGG to perform image classification, hyperparameter tuning etc. In recent years, similar techniques have been applied to natural language processing as well, where a pre-trained model produces word embeddings which are used for the analysis of the new text. One such pre-trained model is BERT- Bidirectional Encoder Representations from Transformers.
In this article, we will talk about the working of BERT along with the different methodologies involved and will implement twitter sentiment analysis using the BERT model.
Working of BERT
Unlike the traditional NLP models that follow a unidirectional approach, that is, reading the text either from left to right or right to left, BERT reads the entire sequence of words at once. BERT makes use of a Transformer which is essentially a mechanism to build relationships between the words in the dataset. In its simplest form, a BERT consists of two processing models- an encoder and a decoder. The encoder reads the input text and the decoder produces the predictions. But, because the main goal of BERT is to create pre pre-trained model, the encoder takes priority over decoder.
The above figure represents an encoder. A sentence is first split into individual words and this is embedded into vectors. The transformer processes these vectors and produces outputs, which are also vectors in which each vector corresponds to an input token with the same index.
Methodologies
One inherent problem when dealing with text is that models try and predict the next word in a sequence, eg: “We live in the same ____”, which is a directional approach and thus limits learning the context of the sentence. To overcome this, there are two methods used:
- Masked LM (MLM)
For our model to understand the context, it needs to be able to decipher the relationship between words in the sentence. So, before the data is fed into the transformer, 15% of the words are replaced with a [mask] token. This way, using the non masked words in the sequence, the model begins to understand the context and tries to predict the [masked] word. The BERT loss function does not consider the prediction of the non-masked words.
- Next Sentence Prediction (NSP)
For this process, the model is fed with pairs of input sentences and the goal is to try and predict whether the second sentence was a continuation of the first in the original document. To do this, 50 % of sentences in input are given as actual pairs from the original document and 50% are given as random sentences. To help the model identify between these two forms of input a [cls] token is placed at the beginning of the first sentence and a [sep] token at the end of each sentence.
The entire input sequence enters the transformer. The [cls] token is converted into a vector and the probability of the next sentence is predicted using the softmax function.
Implementation of BERT to Analyze Twitter Data
Let us consider a simple dataset like twitter sentiment analysis data for the implementation of BERT.
Checking for GPU in Colab:
import torch
if torch.cuda.is_available():
device = torch.device("cuda")
print('There are %d GPU(s) available.' % torch.cuda.device_count())
else:
print('Using CPU.')
device = torch.device("cpu")
Installing the transformer:
!pip install transformers
Loading the data and converting it to NumPy array:
import numpy as np
import pandas as pd
tweet_train=pd.read_csv('https://raw.githubusercontent.com/MohamedAfham/Twitter-Sentiment-Analysis-Supervised-Learning/master/Data/train_tweets.csv')
tweets = tweet_train.tweet.values
labels = tweet_train.label.values
Now, we will initialize the BERT tokenizer and convert each word to a unique token. Here we use a method called encode which helps in combining multiple steps. The method splits the sentences to tokens, adds the [cls] and [sep] tokens and also matches the tokens to id.
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
tweetid = []
for tweet in tweets:
encoded_tweet = tokenizer.encode(tweet,add_special_tokens = True,)
tweetid.append(encoded_tweet)
print('Original: ', tweets[0])
print('Token IDs:', tweetid[0])
Next, we will truncate the sentences so that all the sentences have the same length.
from keras.preprocessing.sequence import pad_sequences
MAX_LEN = 64
print('\n Truncating all sentences to %d values...' % MAX_LEN)
print('\nPadding token: "{:}", ID: {:}'.format(tokenizer.pad_token, tokenizer.pad_token_id))
tweetid = pad_sequences(tweetid, maxlen=MAX_LEN, dtype="long",
value=0, truncating="post", padding="post")
The final step before the training begins is to create masks in the input.
masks = []
for tweet in tweetid:
mask = [int(token_id > 0) for token_id in tweet]
masks.append(mask)
Let us first split the data into training and validation set.
from sklearn.model_selection import train_test_split
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(tweetid, labels, random_state=2018, test_size=0.1)
train_masks, validation_masks, _, _ = train_test_split(masks, labels, random_state=2018, test_size=0.1)
Now, since we are implementing this in PyTorch we will convert the data into tensors.
train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)
train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)
train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
batch_size = 32
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)
We will use the pre-trained BERT sequence classifier model on our data and Adam optimizer. We will set the learning rate to a very small value and initialize a scheduler.
from transformers import BertForSequenceClassification, AdamW, BertConfig
model = BertForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels = 2,
output_attentions = False,
output_hidden_states = False,
)
model.cuda()
optimizer = AdamW(model.parameters(),
lr = 2e-5,
eps = 1e-8
)
from transformers import get_linear_schedule_with_warmup
epochs = 4
total_steps = len(train_dataloader) * epochs
scheduler = get_linear_schedule_with_warmup(optimizer,
num_warmup_steps = 0,
num_training_steps = total_steps)
Training and evaluation:
def accuracy(preds, labels):
pred = np.argmax(preds, axis=1).flatten()
labels = labels.flatten()
return np.sum(pred == labels) / len(labels)
import random
seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)
loss_values = []
for epoch_i in range(0, epochs):
print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
total_loss = 0
model.train()
for step, batch in enumerate(train_dataloader):
if step % 50 == 0 and not step == 0:
print(' Batch {:>5,} of {:>5,}. '.format(step, len(train_dataloader)))
b_input_ids = batch[0].to(device)
b_input_mask = batch[1].to(device)
b_labels = batch[2].to(device)
model.zero_grad()
outputs = model(b_input_ids,
token_type_ids=None,
attention_mask=b_input_mask,
labels=b_labels)
loss = outputs[0]
total_loss += loss.item()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
avg_train_loss = total_loss / len(train_dataloader)
loss_values.append(avg_train_loss)
print(" Average training loss: {0:.2f}".format(avg_train_loss))
print("validation")
model.eval()
eval_loss, eval_accuracy = 0, 0
nb_eval_steps, nb_eval_examples = 0, 0
for batch in validation_dataloader:
batch = tuple(t.to(device) for t in batch)
b_input_ids, b_input_mask, b_labels = batch
with torch.no_grad():
outputs = model(b_input_ids,
token_type_ids=None,
attention_mask=b_input_mask)
logits = outputs[0]
logits = logits.detach().cpu().numpy()
label_ids = b_labels.to('cpu').numpy()
tmp_eval_accuracy = accuracy(logits, label_ids)
eval_accuracy += tmp_eval_accuracy
nb_eval_steps += 1
print(" Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))
#Output:
#Graph:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid')
sns.set(font_scale=1.5)
plt.rcParams["figure.figsize"] = (12,6)
plt.plot(loss_values, 'b-o')
plt.title("Training loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.show()
The results indicate that the accuracy for the BERT model is 97% which means the model performed well even on small datasets. The model has not overfitted as we can see no sharp spike in the graph shown above.
Testing
Let us see how our model performed on test data.
tweet_test=pd.read_csv('https://raw.githubusercontent.com/bhoomikamadhukar/NLP/master/test.csv')
We will have to perform the same processing techniques as we did for training here as well.
tweetvalidation = tweet_test.tweet.values
labels=tweet_test.label.values
test_id = []
for tweet in tweetvalidation:
encoded_tweet = tokenizer.encode(
tweet,
add_special_tokens = True,
)
test_id.append(encoded_tweet)
test_id = pad_sequences(test_id, maxlen=MAX_LEN,
dtype="long", truncating="post", padding="post")
attention_masks = []
for seq in test_id:
seq_mask = [float(i>0) for i in seq]
attention_masks.append(seq_mask)
prediction_inputs = torch.tensor(test_id)
prediction_masks = torch.tensor(attention_masks)
prediction_labels = torch.tensor(labels)
batch_size = 32
prediction_data = TensorDataset(prediction_inputs, prediction_masks, prediction_labels)
prediction_sampler = SequentialSampler(prediction_data)
prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size)
print('Predicting labels for {:,} test sentences...'.format(len(prediction_inputs)))
model.eval()
predictions , true_labels = [], []
for batch in prediction_dataloader:
batch = tuple(t.to(device) for t in batch)
b_input_ids, b_input_mask, b_labels = batch
with torch.no_grad():
outputs = model(b_input_ids, token_type_ids=None,
attention_mask=b_input_mask)
logits = outputs[0]
logits = logits.detach().cpu().numpy()
label_ids = b_labels.to('cpu').numpy()
predictions.append(logits)
true_labels.append(label_ids)
from sklearn.metrics import classification_report
for i in range(len(true_labels)):
pred_labels_flattening = np.argmax(predictions[i], axis=1).flatten()
print(classification_report(true_labels[0], pred_labels_flattening) )
precision recall f1-score support
0 0.00 1.00 0.52 11
1 1.00 0.05 0.09 21
accuracy 0.78 32
macro avg 0.68 0.52 0.31 32
weighted avg 0.78 0.38 0.24 32
Thus we can see that within a short period of time we can build a BERT model that works on test data with a fairly good score.
Conclusion
Without doubt, BERT is a remarkable breakthrough in the field of NLP and the fact that it is easy to implement and fast adds the advantages of exploring the algorithm and building models to solve a lot of practical problems in the real world. Through this article, we successfully implemented the BERT to Analyze Twitter Data.