# How I used Bidirectional Encoder Representations from Transformers (BERT) to Analyze Twitter Data

In this article, we will talk about the working of BERT along with the different methodologies involved and will implement twitter sentiment analysis using the BERT model.

Time and again, the value of transfer learning has proven to be of utmost importance to computer vision researchers, like pre-training a neural network like VGG to perform image classification, hyperparameter tuning etc. In recent years, similar techniques have been applied to natural language processing as well, where a pre-trained model produces word embeddings which are used for the analysis of the new text. One such pre-trained model is BERT- Bidirectional Encoder Representations from Transformers.

In this article, we will talk about the working of BERT along with the different methodologies involved and will implement twitter sentiment analysis using the BERT model.

### Working of BERT

Unlike the traditional NLP models that follow a unidirectional approach, that is, reading the text either from left to right or right to left, BERT reads the entire sequence of words at once. BERT makes use of a Transformer which is essentially a mechanism to build relationships between the words in the dataset. In its simplest form, a BERT consists of two processing models- an encoder and a decoder. The encoder reads the input text and the decoder produces the predictions. But, because the main goal of BERT is to create pre pre-trained model, the encoder takes priority over decoder.

The above figure represents an encoder. A sentence is first split into individual words and this is embedded into vectors. The transformer processes these vectors and produces outputs, which are also vectors in which each vector corresponds to an input token with the same index.

### Methodologies

One inherent problem when dealing with text is that models try and predict the next word in a sequence, eg:  “We live in the same ____”, which is a directional approach and thus limits learning the context of the sentence. To overcome this, there are two methods used:

For our model to understand the context, it needs to be able to decipher the relationship between words in the sentence. So, before the data is fed into the transformer, 15% of the words are replaced with a [mask] token. This way, using the non masked words in the sequence, the model begins to understand the context and tries to predict the [masked] word. The BERT loss function does not consider the prediction of the non-masked words.

1. Next Sentence Prediction (NSP)

For this process, the model is fed with pairs of input sentences and the goal is to try and predict whether the second sentence was a continuation of the first in the original document. To do this, 50 % of sentences in input are given as actual pairs from the original document and 50% are given as random sentences. To help the model identify between these two forms of input a [cls] token is placed at the beginning of the first sentence and a [sep] token at the end of each sentence.

The entire input sequence enters the transformer. The [cls] token is converted into a vector and the probability of the next sentence is predicted using the softmax function.

### Implementation of BERT to Analyze Twitter Data

Let us consider a simple dataset like twitter sentiment analysis data for the implementation of BERT.

Checking for GPU in Colab:

import torchif torch.cuda.is_available():        device = torch.device("cuda")    print('There are %d GPU(s) available.' % torch.cuda.device_count())else:    print('Using CPU.')    device = torch.device("cpu")

Installing the transformer:

!pip install transformers

import numpy as npimport pandas as pdtweet_train=pd.read_csv('https://raw.githubusercontent.com/MohamedAfham/Twitter-Sentiment-Analysis-Supervised-Learning/master/Data/train_tweets.csv')tweets = tweet_train.tweet.valueslabels = tweet_train.label.values

Now, we will initialize the BERT tokenizer and convert each word to a unique token. Here we use a method called encode which helps in combining multiple steps. The method splits the sentences to tokens, adds the [cls] and [sep] tokens and also matches the tokens to id.

from transformers import BertTokenizertokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)tweetid = []for tweet in tweets:  encoded_tweet = tokenizer.encode(tweet,add_special_tokens = True,)  tweetid.append(encoded_tweet)print('Original: ', tweets[0])print('Token IDs:', tweetid[0])

Next, we will truncate the sentences so that all the sentences have the same length.

from keras.preprocessing.sequence import pad_sequencesMAX_LEN = 64print('\n Truncating all sentences to %d values...' % MAX_LEN)print('\nPadding token: "{:}", ID: {:}'.format(tokenizer.pad_token, tokenizer.pad_token_id))tweetid = pad_sequences(tweetid, maxlen=MAX_LEN, dtype="long",                           value=0, truncating="post", padding="post")

The final step before the training begins is to create masks in the input.

masks = []for tweet in tweetid:  mask = [int(token_id > 0) for token_id in tweet]  masks.append(mask)

Let us first split the data into training and validation set.

from sklearn.model_selection import train_test_splittrain_inputs, validation_inputs, train_labels, validation_labels = train_test_split(tweetid, labels, random_state=2018, test_size=0.1)train_masks, validation_masks, _, _ = train_test_split(masks, labels, random_state=2018, test_size=0.1)

Now, since we are implementing this in PyTorch we will convert the data into tensors.

train_inputs = torch.tensor(train_inputs)validation_inputs = torch.tensor(validation_inputs)train_labels = torch.tensor(train_labels)validation_labels = torch.tensor(validation_labels)train_masks = torch.tensor(train_masks)validation_masks = torch.tensor(validation_masks)from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSamplerbatch_size = 32train_data = TensorDataset(train_inputs, train_masks, train_labels)train_sampler = RandomSampler(train_data)train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)validation_sampler = SequentialSampler(validation_data)validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

We will use the pre-trained BERT sequence classifier model on our data and Adam optimizer. We will set the learning rate to a very small value and initialize a scheduler.

from transformers import BertForSequenceClassification, AdamW, BertConfigmodel = BertForSequenceClassification.from_pretrained(    "bert-base-uncased",    num_labels = 2,     output_attentions = False,     output_hidden_states = False, )model.cuda()optimizer = AdamW(model.parameters(),                  lr = 2e-5,                   eps = 1e-8                 )from transformers import get_linear_schedule_with_warmupepochs = 4total_steps = len(train_dataloader) * epochsscheduler = get_linear_schedule_with_warmup(optimizer,                                             num_warmup_steps = 0,                                             num_training_steps = total_steps)

Training and evaluation:

def accuracy(preds, labels):    pred = np.argmax(preds, axis=1).flatten()    labels = labels.flatten()    return np.sum(pred == labels) / len(labels)import randomseed_val = 42random.seed(seed_val)np.random.seed(seed_val)torch.manual_seed(seed_val)torch.cuda.manual_seed_all(seed_val)loss_values = []for epoch_i in range(0, epochs):  print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))  total_loss = 0  model.train()  for step, batch in enumerate(train_dataloader):    if step % 50 == 0 and not step == 0:      print('  Batch {:>5,}  of  {:>5,}. '.format(step, len(train_dataloader)))    b_input_ids = batch[0].to(device)    b_input_mask = batch[1].to(device)    b_labels = batch[2].to(device)    model.zero_grad()            outputs = model(b_input_ids,                 token_type_ids=None,                 attention_mask=b_input_mask,                 labels=b_labels)    loss = outputs[0]    total_loss += loss.item()    loss.backward()    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)    optimizer.step()    scheduler.step()    avg_train_loss = total_loss / len(train_dataloader)                loss_values.append(avg_train_loss)    print("  Average training loss: {0:.2f}".format(avg_train_loss))    print("validation")    model.eval()    eval_loss, eval_accuracy = 0, 0    nb_eval_steps, nb_eval_examples = 0, 0for batch in validation_dataloader:    batch = tuple(t.to(device) for t in batch)    b_input_ids, b_input_mask, b_labels = batch    with torch.no_grad():                outputs = model(b_input_ids,                         token_type_ids=None,                         attention_mask=b_input_mask)    logits = outputs[0]    logits = logits.detach().cpu().numpy()    label_ids = b_labels.to('cpu').numpy()    tmp_eval_accuracy = accuracy(logits, label_ids)    eval_accuracy += tmp_eval_accuracy    nb_eval_steps += 1print("  Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))#Output:#Graph: import matplotlib.pyplot as pltimport seaborn as snssns.set(style='darkgrid')sns.set(font_scale=1.5)plt.rcParams["figure.figsize"] = (12,6)plt.plot(loss_values, 'b-o')plt.title("Training loss")plt.xlabel("Epoch")plt.ylabel("Loss")plt.show()

The results indicate that the accuracy for the BERT model is 97% which means the model performed well even on small datasets. The model has not overfitted as we can see no sharp spike in the graph shown above.

### Testing

Let us see how our model performed on test data.

tweet_test=pd.read_csv('https://raw.githubusercontent.com/bhoomikamadhukar/NLP/master/test.csv')

We will have to perform the same processing techniques as we did for training here as well.

tweetvalidation = tweet_test.tweet.valueslabels=tweet_test.label.valuestest_id = []for tweet in tweetvalidation:    encoded_tweet = tokenizer.encode(                        tweet,                                            add_special_tokens = True,                    )    test_id.append(encoded_tweet)test_id = pad_sequences(test_id, maxlen=MAX_LEN,                           dtype="long", truncating="post", padding="post")attention_masks = []for seq in test_id:  seq_mask = [float(i>0) for i in seq]  attention_masks.append(seq_mask) prediction_inputs = torch.tensor(test_id)prediction_masks = torch.tensor(attention_masks)prediction_labels = torch.tensor(labels)batch_size = 32  prediction_data = TensorDataset(prediction_inputs, prediction_masks, prediction_labels)prediction_sampler = SequentialSampler(prediction_data)prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size)print('Predicting labels for {:,} test sentences...'.format(len(prediction_inputs)))model.eval()predictions , true_labels = [], []for batch in prediction_dataloader:  batch = tuple(t.to(device) for t in batch)  b_input_ids, b_input_mask, b_labels = batch  with torch.no_grad():      outputs = model(b_input_ids, token_type_ids=None,                       attention_mask=b_input_mask)  logits = outputs[0]  logits = logits.detach().cpu().numpy()  label_ids = b_labels.to('cpu').numpy()  predictions.append(logits)  true_labels.append(label_ids)from sklearn.metrics import classification_reportfor i in range(len(true_labels)):  pred_labels_flattening = np.argmax(predictions[i], axis=1).flatten()print(classification_report(true_labels[0], pred_labels_flattening) )              precision    recall  f1-score   support           0       0.00      1.00      0.52        11           1       1.00      0.05      0.09        21    accuracy                           0.78        32   macro avg       0.68      0.52      0.31        32weighted avg       0.78      0.38      0.24        32

Thus we can see that within a short period of time we can build a BERT model that works on test data with a fairly good score.

### Conclusion

Without doubt, BERT is a remarkable breakthrough in the field of NLP and the fact that it is easy to implement and fast adds the advantages of exploring the algorithm and building models to solve a lot of practical problems in the real world. Through this article, we successfully implemented the BERT to Analyze Twitter Data.

## More Great AIM Stories

### TensorFlow 2.5.0 Released: All Major Updates & Features

I am an aspiring data scientist with a passion for teaching. I am a computer science graduate from Dayananda Sagar Institute. I have experience in building models in deep learning and reinforcement learning. My goal is to use AI in the field of education to make learning meaningful for everyone.

## Our Upcoming Events

Conference, in-person (Bangalore)
MachineCon 2022
24th Jun

Conference, Virtual
Deep Learning DevCon 2022
30th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

### Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

### Telegram Channel

Discover special offers, top stories, upcoming events, and more.

##### MORE FROM AIM

LTI and Mindtree both play in Analytics services businesses, just like most other large IT/ITes service providers. But, what would the analytics services business of the merged entity look like?

##### GitHub now offers math support in markdown

GitHub’s math rendering capability uses MathJax; an open-source, JavaScript-based display engine.

Meta recently organised messaging event called ‘Conversations.’

##### Wipro announces 40,000 sq.ft. Innovation Studio in Texas

The studio will leverage Wipro’s deep reservoir of IPs, patents, and innovation DNA.

##### Google’s facial recognition tech to replace smart cards in Bengaluru metro trains￼

BMRCL plans to introduce the technology at its automatic fare collection gates.

##### Data science hiring process at DealShare

In the next few months, DealShare looks to grow its data science team by 15-20 members.

##### DeepMind’s AlphaFold 2 is half of the story

The idea was if I give you a sequence of amino acids, can you predict what will be the structure or the shape that it will take in the 3D space?

##### Lenskart invests USD 2 Mn in location intelligence platform GeoIQ

GeoIQ’s AI-based location tool will help Lenskart with its aggressive store rollout strategy.

##### TensorFlow v2.9 released: Major highlights

The main highlights of this release are performance enhancement with oneDNN and the release of a new API for model distribution, called DTensor