How I used Bidirectional Encoder Representations from Transformers (BERT) to Analyze Twitter Data

In this article, we will talk about the working of BERT along with the different methodologies involved and will implement twitter sentiment analysis using the BERT model.
BERT to Analyze Twitter Data

Time and again, the value of transfer learning has proven to be of utmost importance to computer vision researchers, like pre-training a neural network like VGG to perform image classification, hyperparameter tuning etc. In recent years, similar techniques have been applied to natural language processing as well, where a pre-trained model produces word embeddings which are used for the analysis of the new text. One such pre-trained model is BERT- Bidirectional Encoder Representations from Transformers. 

In this article, we will talk about the working of BERT along with the different methodologies involved and will implement twitter sentiment analysis using the BERT model.

Working of BERT

Unlike the traditional NLP models that follow a unidirectional approach, that is, reading the text either from left to right or right to left, BERT reads the entire sequence of words at once. BERT makes use of a Transformer which is essentially a mechanism to build relationships between the words in the dataset. In its simplest form, a BERT consists of two processing models- an encoder and a decoder. The encoder reads the input text and the decoder produces the predictions. But, because the main goal of BERT is to create pre pre-trained model, the encoder takes priority over decoder. 

The above figure represents an encoder. A sentence is first split into individual words and this is embedded into vectors. The transformer processes these vectors and produces outputs, which are also vectors in which each vector corresponds to an input token with the same index.


One inherent problem when dealing with text is that models try and predict the next word in a sequence, eg:  “We live in the same ____”, which is a directional approach and thus limits learning the context of the sentence. To overcome this, there are two methods used:

  1. Masked LM (MLM)

For our model to understand the context, it needs to be able to decipher the relationship between words in the sentence. So, before the data is fed into the transformer, 15% of the words are replaced with a [mask] token. This way, using the non masked words in the sequence, the model begins to understand the context and tries to predict the [masked] word. The BERT loss function does not consider the prediction of the non-masked words.

  1. Next Sentence Prediction (NSP) 

For this process, the model is fed with pairs of input sentences and the goal is to try and predict whether the second sentence was a continuation of the first in the original document. To do this, 50 % of sentences in input are given as actual pairs from the original document and 50% are given as random sentences. To help the model identify between these two forms of input a [cls] token is placed at the beginning of the first sentence and a [sep] token at the end of each sentence. 

The entire input sequence enters the transformer. The [cls] token is converted into a vector and the probability of the next sentence is predicted using the softmax function.

Implementation of BERT to Analyze Twitter Data

Let us consider a simple dataset like twitter sentiment analysis data for the implementation of BERT. 

Checking for GPU in Colab:

import torch

   device = torch.device("cuda")
   print('There are %d GPU(s) available.' % torch.cuda.device_count())
   print('Using CPU.')
   device = torch.device("cpu")

Installing the transformer:

!pip install transformers

Loading the data and converting it to NumPy array:

import numpy as np
import pandas as pd


BERT to Analyze Twitter Data

tweets = tweet_train.tweet.values

labels = tweet_train.label.values

Now, we will initialize the BERT tokenizer and convert each word to a unique token. Here we use a method called encode which helps in combining multiple steps. The method splits the sentences to tokens, adds the [cls] and [sep] tokens and also matches the tokens to id.

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
tweetid = []
for tweet in tweets:
 encoded_tweet = tokenizer.encode(tweet,add_special_tokens = True,)

print('Original: ', tweets[0])
print('Token IDs:', tweetid[0])


Next, we will truncate the sentences so that all the sentences have the same length. 

from keras.preprocessing.sequence import pad_sequences

MAX_LEN = 64

print('\n Truncating all sentences to %d values...' % MAX_LEN)

print('\nPadding token: "{:}", ID: {:}'.format(tokenizer.pad_token, tokenizer.pad_token_id))

tweetid = pad_sequences(tweetid, maxlen=MAX_LEN, dtype="long", 

                          value=0, truncating="post", padding="post")

The final step before the training begins is to create masks in the input. 

masks = []
for tweet in tweetid:
 mask = [int(token_id > 0) for token_id in tweet]

Let us first split the data into training and validation set. 

from sklearn.model_selection import train_test_split

train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(tweetid, labels, random_state=2018, test_size=0.1)

train_masks, validation_masks, _, _ = train_test_split(masks, labels, random_state=2018, test_size=0.1)

Now, since we are implementing this in PyTorch we will convert the data into tensors. 

train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)

train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)

train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)

from import TensorDataset, DataLoader, RandomSampler, SequentialSampler

batch_size = 32

train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

We will use the pre-trained BERT sequence classifier model on our data and Adam optimizer. We will set the learning rate to a very small value and initialize a scheduler. 

from transformers import BertForSequenceClassification, AdamW, BertConfig

model = BertForSequenceClassification.from_pretrained(
   num_labels = 2, 
   output_attentions = False, 
   output_hidden_states = False, 

optimizer = AdamW(model.parameters(),
                 lr = 2e-5, 
                 eps = 1e-8 

from transformers import get_linear_schedule_with_warmup

epochs = 4

total_steps = len(train_dataloader) * epochs
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                           num_warmup_steps = 0, 
                                           num_training_steps = total_steps)

Training and evaluation:

def accuracy(preds, labels):
   pred = np.argmax(preds, axis=1).flatten()
   labels = labels.flatten()
   return np.sum(pred == labels) / len(labels)

import random

seed_val = 42


loss_values = []

for epoch_i in range(0, epochs):
 print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
 total_loss = 0
 for step, batch in enumerate(train_dataloader):
   if step % 50 == 0 and not step == 0:
     print('  Batch {:>5,}  of  {:>5,}. '.format(step, len(train_dataloader)))
   b_input_ids = batch[0].to(device)
   b_input_mask = batch[1].to(device)
   b_labels = batch[2].to(device)


   outputs = model(b_input_ids, 

   loss = outputs[0]
   total_loss += loss.item()
   torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
   avg_train_loss = total_loss / len(train_dataloader)           

   print("  Average training loss: {0:.2f}".format(avg_train_loss))


   eval_loss, eval_accuracy = 0, 0
   nb_eval_steps, nb_eval_examples = 0, 0

for batch in validation_dataloader:
   batch = tuple( for t in batch)
   b_input_ids, b_input_mask, b_labels = batch
   with torch.no_grad():        
       outputs = model(b_input_ids, 
   logits = outputs[0]
   logits = logits.detach().cpu().numpy()
   label_ids ='cpu').numpy()
   tmp_eval_accuracy = accuracy(logits, label_ids)
eval_accuracy += tmp_eval_accuracy
   nb_eval_steps += 1

("  Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))

BERT to Analyze Twitter Data


import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams["figure.figsize"] = (12,6)
plt.plot(loss_values, 'b-o')
plt.title("Training loss")
BERT to Analyze Twitter Data






The results indicate that the accuracy for the BERT model is 97% which means the model performed well even on small datasets. The model has not overfitted as we can see no sharp spike in the graph shown above. 


Let us see how our model performed on test data.


We will have to perform the same processing techniques as we did for training here as well.

tweetvalidation = tweet_test.tweet.values
test_id = []

for tweet in tweetvalidation:
   encoded_tweet = tokenizer.encode(
                       add_special_tokens = True, 
test_id = pad_sequences(test_id, maxlen=MAX_LEN, 
                         dtype="long", truncating="post", padding="post")

attention_masks = []
for seq in test_id:
 seq_mask = [float(i>0) for i in seq]
prediction_inputs = torch.tensor(test_id)
prediction_masks = torch.tensor(attention_masks)
prediction_labels = torch.tensor(labels)
batch_size = 32  

prediction_data = TensorDataset(prediction_inputs, prediction_masks, prediction_labels)
prediction_sampler = SequentialSampler(prediction_data)
prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size)
print('Predicting labels for {:,} test sentences...'.format(len(prediction_inputs)))


predictions , true_labels = [], []

for batch in prediction_dataloader:
 batch = tuple( for t in batch)
 b_input_ids, b_input_mask, b_labels = batch
 with torch.no_grad():
     outputs = model(b_input_ids, token_type_ids=None, 
 logits = outputs[0]
 logits = logits.detach().cpu().numpy()
 label_ids ='cpu').numpy()

from sklearn.metrics import classification_report
for i in range(len(true_labels)):
 pred_labels_flattening = np.argmax(predictions[i], axis=1).flatten()
print(classification_report(true_labels[0], pred_labels_flattening) )

             precision    recall  f1-score   support
          0       0.00      1.00      0.52        11
          1       1.00      0.05      0.09        21
   accuracy                           0.78        32
  macro avg       0.68      0.52      0.31        32
weighted avg       0.78      0.38      0.24        32

Thus we can see that within a short period of time we can build a BERT model that works on test data with a fairly good score.


Without doubt, BERT is a remarkable breakthrough in the field of NLP and the fact that it is easy to implement and fast adds the advantages of exploring the algorithm and building models to solve a lot of practical problems in the real world. Through this article, we successfully implemented the BERT to Analyze Twitter Data.

Download our Mobile App

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox