With the advent of AI, we are witnessing some of the remarkable things which were once deemed impossible now being completely achievable. AI has found its way in Medical, retail, e-commerce, IT and pretty much every domain.
Nowadays AI can be used to write code, write resumes, articles, self-drive cars, detect terminal diseases, optimize supply chain management, find the shortest route and the list goes on.
In this article we will get our hands on BERT and use it to classify a sentence is grammatically correct or not, To cover BERT in its entirety is out of scope for this blog but I have linked few resources if someone would like to go in-depth with BERT.
Sign up for your weekly dose of what's up in emerging technology.
As an overview:-
Bert is essentially a language model based on transformer encoder-decoder architecture.
- The language representation model for BERT, which represents the two-way encoder representation of Transformer. Unlike other recent language representation models, BERT aims to pre-train deep two-way representations by adjusting the context throughout all layers. Therefore, the pre-trained BERT representation can be fine-tuned through an additional output layer, thus making it suitable for the construction of state-of-the-art models for a wide range of tasks, such as question-answering, text classification, sentence generation etc.
- This makes BERT a potent candidate for text-related tasks
- Standard language modelling is unidirectional, which makes the types of architectures that can be used in the pre-training of the model to be limited. And not being able to capture the context of sentences in an optimal way.
This image, downloaded from this link shows the encoder-decoder architecture and how both of them are really similar while using multi-head attention and feed-forward neural networks.
Download our Mobile App
BERT model architecture
BERT denotes the number of layers (ie, Transformer blocks) as L, the hidden size as H, and the number of self-attention heads as A. In all cases, the size of the feed-forward/filter is set to 4H, which is 3072 for H=768 and 4096 for H=1024. This leads to 2 sizes for the BERT model:
BERT BASE: L=12, H=768, A=12, Total Parameters=110M
BERT LARGE : L=24, H=1024, A=16, Total Parameters=340M
BERT LARGE has the same model size as OpenAI GPT. However, it is important to note that the BERT Transformer uses a two-way self-attention, and the GPT Transformer uses a restricted self-attention, where each token can only handle the context to its left. The research team noted that in the literature, the two-way Transformer is often referred to as the ” Transformer encoder ” hence BERT uses the Encoder part of the Transformer to calculate the embeddings and the left context is called ” Transformer decoder ” because it can be used for text generation and the decoder part is used by models like GPT where decoder can be used for text generation.
The input representation of a text can explicitly represent a single text sentence or a pair of text sentences in a token sequence (for example, [Question, Answer]). For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings. Figure 2 is a visual representation of the input representation:
Figure 2 (Source): BERT input representation. Input embedding is the sum of token embeddings, segmentation embeddings and position embeddings.
- BERT uses WordPiece Embed (Wu et al., 2016) and vocabulary up to 30,000 tokens.
- Using the learned positional embeddings, the supported sequences are up to 512 tokens in length.
- The first token for each sequence is always a special classification embedding ([CLS]). The final hidden state corresponding to the token (ie, the output of the Transformer) is used as an aggregated sequence representation of the classification task. This vector is ignored for non-categorical tasks.
- The sentence pairs are packed into a sequence. Differentiate sentences in two ways. First, separate them with a special mark ([SEP]). Second, add a learned sentence A embedded in each token of the first sentence, and a sentence B is embedded in each token of the second sentence.
Impact of the BERT model
BERT is a language representation model trained by huge data, huge models, and enormous computational overhead. It is optimal in 11 natural language processing tasks (state-of-the-art) , SOTA) results. It is estimated that many people will ridicule experiments of this scale, which are basically out of reach for general laboratories and researchers, but it does give us a lot of valuable experience. The reason for the high-performance of the BERT model is due to two points. In addition to the improvement of the model, it is more important to use a large data set (BooksCorpus 800M + English Wikipedia 2.5G words) and a large computing power. Pre-training on related tasks, achieving monotonous growth in performance on target tasks.
Overview of Task
Now the task which we are to do is to predict whether a sentence is grammatically correct or not. We choose a dataset in which we are given a piece of text and along with it a label .which is referred to as the “whether it is grammatically correct or not”.
0 means grammatically incorrect and 1 means grammatically correct.
So essentially this problem translates into a text classification problem where we classify whether a sentence is grammatically correct or not.
Getting started with Modeling
We’ll use [The Corpus of Linguistic Acceptability (CoLA)](https://nyu-mll.github.io/CoLA/) dataset for single sentence classification. It’s a set of sentences labelled as grammatically correct or incorrect. It was first published in May of 2018 and is one of the tests included in the “GLUE Benchmark” on which models like BERT are competing
A snapshot of data
Number of training sentences: 8,551
|4716||ks08||1||NaN||The son took care of his parents.|
|418||bc01||0||*||John wonders where him to go.|
|891||bc01||1||NaN||If Ron knows whether to wear a tuxedo and Cas…|
Training the BERT model involves the following steps:-
- Installing transformers library
- Loading data and tokenizer then tokenizing the sentences. Tokenizer splits the sentences in words.
- Getting inputs ids, segments ids and attention masks from the ids
- Splitting the data into train and validation sets
- Loading and choosing BERT model (BERT classification in our case)
- Specifying optimizer and scheduler
- Training and evaluating the model and then getting inference based on our examples.
- For example:- If we enter the input “you doing good” the model predicts ‘Grammatically InCorrect’ which is indeed Correct
That’s all for now. Understanding BERT is not an easy thing per se but if you follow the basics and keep making small projects like these it becomes really intuitive. A good starting point is the NLP Specialization on Coursera by deep elearning.org:- https://www.coursera.org/specializations/natural-language-processing and http://jalammar.github.io/illustrated-bert/, https://arxiv.org/pdf/1810.04805.pdf
Google Colab link for the code