Various state-of-the-art NLP applications like sentiment analysis, question answering, smart assistance, etc. require a tremendous amount of data. This large amount of data can be directly fed to the machine learning model. Almost all the text-based applications require a lot of pre-processing with the textual data such as creating the embedding vectors from scratch using the word frequency counter. This consumes a lot of effort and time. To overcome this, transfer learning models are used now for all complex pre-processing tasks. Here, we just need to feed our raw text to the transfer learning model and the rest of the processes are taken care of by it.
In contrast to this, we are going to discuss one such transfer learning framework, BERT, in this article. We will see how to use the BERT pre-processing module to easily generate word embeddings without putting in a lot of effort. The major points to be covered in this article are listed below.
Table of Contents
- Standard Procedure of Text Pre-processing
- What is BERT?
- Working of BERT
- Pre-processing Model
Let’s start with the discussion.
Standard Procedure of Text Pre-processing
A technique known as text preprocessing is used to clean up text data before feeding it to a machine-learning model. Text data contains a variety of noise, such as emotions, punctuation, and text in a different capitalization. This is only the beginning of the difficulties we will face because machines cannot understand words, they require numbers. So we must find a fast and efficient way to transform text to numbers.
The standard or conventional procedure of pre-processing is a little bit tedious and also a user-centric procedure. The below steps are carried out under the hood of standard pre-processing techniques:
- Lower casing the corpus
- Removing the punctuation
- Removing the stopwords
- Tokenizing the corpus
- Stemming and Lemmatization
- Word embeddings using CountVectorizer and TF-IDF
The worked-out examples of the above steps have been covered in these articles: Complete Tutorial on Text Preprocessing in NLP and How to Identify Entities in NLP?
Usually while approaching any NLP problem we tend to follow this process and the above process does not ensure any reasonable result if our raw data changes slightly. This means if the data is from a web page there we need additional work to remove HTML tags. Nowadays all these pre-processing steps can be carried out by using transfer learning modules like BERT.
What is BERT?
BERT is an acronym for Bidirectional Encoder Representations from Transformers. In order to pre-train deep bidirectional representations from unlabeled text, the system uses context conditioning on both the left and right sides of the sentence. As a result, the pre-trained BERT model could also be fine-tuned by adding only one more output layer to produce cutting-edge models for a wide range of NLP tasks.
BERT has been pre-trained on a vast corpus of unlabeled text, including the entire Wikipedia, which is 2,500 million words long, and various Book Corpus, which is over 800 million words long. Half of BERT’s success can be attributed to this pre-training phase. That’s because, as the model is trained on a big text corpus, it begins to pick up on the more subtle and personal details of how the language works. This information can be applied to a wide variety of NLP tasks.
BERT is a model which is quite bidirectional. Bidirectional indicates that during the training phase, BERT learns information from both the left and right sides of a token’s context. A model’s bidirectionality is essential for completely comprehending the meaning of a language.
Working of BERT
To learn the contextual relationships between words in a text, BERT utilizes Transformer, an attention mechanism. The transformer’s vanilla implementation has two mechanisms: an encoder that receives text input and a decoder that predicts the task. Only the encoder mechanism is required because the purpose of BERT is to construct a language model.
The Transformer encoder reads the entire sequence of words at once, unlike directional versions that read the text input sequentially. It is classed as bidirectional as a result of this, while the actual term is non-directional. This feature allows the model to learn a word’s context based on its surroundings.

Figure1: BERT
During the BERT training process, pairs of sentences are provided as input to the model, and it learns to predict whether or not the second sentence in the pair is the following sentence in the original document. Half of the inputs during training are pairs where the second sentence is the next sentence in the original document while the other half is a random sentence from the corpus. The underlying assumption is that the second phrase will be unrelated to the first.
During training, as shown above, a [CLS] token is inserted at the beginning of the first sentence and a [SEP] token is introduced at the end of each sentence, with each token containing a sentence embedding indicating Sentence A or Sentence B. Sentence embeddings are essentially similar to token embeddings, but with a two-word vocabulary. Finally, each token is assigned a positional embedding that corresponds to its place in the sequence.

Figure 2: BERT Pre-training
Before feeding word sequences into BERT, some part of each sequence is replaced with a [MASK] token. The model then makes an attempt to forecast the original value of the masked words using the context provided by the other, non-masked phrases in the sequence. It is necessary to add a classification layer on top of the encoder output in order to predict the output words. This is followed by multiplying the encoder output vectors by the embedding matrix, transforming them into the vocabulary dimension, and computing the probability of each word in the vocabulary using softmax.
The BERT loss function only considers the predictions of the masked values and ignores the predictions of the non-masked words. Consequently, the model converges more slowly than directional models. When learning the BERT model, Masked LM (shown in Figure1) and Pre-training (shown in Figure2) are trained jointly in order to minimize the combined loss function of the two techniques.
BERT Pre-processing Model
There are a variety of Pre-trained BERT models available on Tensorflow Hub like original BERT, ALBERT, Electra, and MuRIL which is a multilingual representation for Indian language, pre-trained on 17 different Indian languages, and many more available. Encoder and pre-processing API is available for all the above models.
There is a preprocessing model for each BERT encoder. Using TensorFlow operators from the TF.text package, it converts raw text to the numeric input tensors expected by the encoder. Unlike pure Python preprocessing, these operations can be incorporated into a TensorFlow model for serving directly from text inputs. Each TF Hub preprocessing model comes preconfigured with a vocabulary and its related text normalization logic and requires no additional configuration.

Let’s implement a few examples of pre-processing.
!pip3 install --quiet tensorflow-text import tensorflow_hub as hub import tensorflow_text as text # load the pre-processing model preprocess = hub.load('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/1') # Use BERT pre-processing on a batch of raw text inputs. embeddings = preprocess(['Blog writing is awesome.'])

The pre-processed output from the module is obtained as shown above, where the input mask tells the encoder which part of a sentence is important because here padding is present. Further, the input type id indicates to the model which part of the input corresponds to 1st sentence and second sentence in our given only one sentence. Input_word_ids are the indices that correspond to each token.
Conclusion
In this post, we have understood what BERT actually is and how it works. We also saw how easily the word embedding can be implemented using BERT pre-processing modules. All the traditional pre-processing steps are included in the BERT pre-processing modules which saves a lot of time while building the NLP-based model.