How To Use BERT Transformer For Grammar Checking?

With the advent of AI, we are witnessing some of the remarkable things which were once deemed impossible now being completely achievable. AI has found its way in Medical, retail, e-commerce, IT and pretty much every domain. 

Nowadays AI can be used to write code, write resumes, articles, self-drive cars, detect terminal diseases, optimize supply chain management, find the shortest route and the list goes on.

In this article we will get our hands on BERT and use it to classify a sentence is grammatically correct or not, To cover BERT in its entirety is out of scope for this blog but I have linked few resources if someone would like to go in-depth with BERT.


Sign up for your weekly dose of what's up in emerging technology.

As an overview:- 

Bert is essentially a language model based on transformer encoder-decoder architecture.

  • The language representation model for BERT, which represents the two-way encoder representation of Transformer. Unlike other recent language representation models, BERT aims to pre-train deep two-way representations by adjusting the context throughout all layers. Therefore, the pre-trained BERT representation can be fine-tuned through an additional output layer, thus making it suitable for the construction of state-of-the-art models for a wide range of tasks, such as question-answering, text classification, sentence generation etc.
  • This makes BERT a potent candidate for text-related tasks 
  • Standard language modelling is unidirectional, which makes the types of architectures that can be used in the pre-training of the model to be limited. And not being able to capture the context of sentences in an optimal way.

This image, downloaded from this link shows the encoder-decoder architecture and how both of them are really similar while using multi-head attention and feed-forward neural networks.

Download our Mobile App

BERT model architecture

BERT  denotes the number of layers (ie, Transformer blocks) as L, the hidden size as H, and the number of self-attention heads as A. In all cases, the size of the feed-forward/filter is set to 4H, which is 3072 for H=768 and 4096 for H=1024. This leads to 2 sizes for the BERT model:

BERT BASE: L=12, H=768, A=12, Total Parameters=110M

BERT LARGE : L=24, H=1024, A=16, Total Parameters=340M

BERT LARGE has the same model size as OpenAI GPT. However, it is important to note that the BERT Transformer uses a two-way self-attention, and the GPT Transformer uses a restricted self-attention, where each token can only handle the context to its left. The research team noted that in the literature, the two-way Transformer is often referred to as the ” Transformer encoder ” hence BERT uses the Encoder part of the Transformer to calculate the embeddings and the left context is called ” Transformer decoder ” because it can be used for text generation and the decoder part is used by models like GPT where decoder can be used for text generation. 

Input representation

The input representation of a text can explicitly represent a single text sentence or a pair of text sentences in a token sequence (for example, [Question, Answer]). For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings. Figure 2 is a visual representation of the input representation:

  Figure 2 (Source): BERT input representation. Input embedding is the sum of token embeddings, segmentation embeddings and position embeddings.

  • BERT uses WordPiece Embed (Wu et al., 2016) and vocabulary up to 30,000 tokens. 
  • Using the learned positional embeddings, the supported sequences are up to 512 tokens in length.
  • The first token for each sequence is always a special classification embedding ([CLS]). The final hidden state corresponding to the token (ie, the output of the Transformer) is used as an aggregated sequence representation of the classification task. This vector is ignored for non-categorical tasks.
  • The sentence pairs are packed into a sequence. Differentiate sentences in two ways. First, separate them with a special mark ([SEP]). Second, add a learned sentence A embedded in each token of the first sentence, and a sentence B is embedded in each token of the second sentence.

Impact of the BERT model

BERT is a language representation model trained by huge data, huge models, and enormous computational overhead. It is optimal in 11 natural language processing tasks (state-of-the-art) , SOTA) results. It is estimated that many people will ridicule experiments of this scale, which are basically out of reach for general laboratories and researchers, but it does give us a lot of valuable experience. The reason for the high-performance of the BERT model is due to two points. In addition to the improvement of the model, it is more important to use a large data set (BooksCorpus 800M + English Wikipedia 2.5G words) and a large computing power. Pre-training on related tasks, achieving monotonous growth in performance on target tasks.

Overview of Task

Now the task which we are to do is to predict whether a sentence is grammatically correct or not. We choose a dataset in which we are given a piece of text and along with it a label .which is referred to as the “whether it is grammatically correct or not”.

0 means grammatically incorrect and 1 means grammatically correct.

So essentially this problem translates into a text classification problem where we classify whether a sentence is grammatically correct or not.

Getting started with Modeling

We’ll use [The Corpus of Linguistic Acceptability (CoLA)]( dataset for single sentence classification. It’s a set of sentences labelled as grammatically correct or incorrect. It was first published in May of 2018 and is one of the tests included in the “GLUE Benchmark” on which models like BERT are competing

A snapshot of data

Number of training sentences: 8,551

4716ks081NaNThe son took care of his parents.
418bc010*John wonders where him to go.
891bc011NaNIf Ron knows whether to wear a tuxedo and Cas…

Training the BERT model involves the following steps:-

  • Installing transformers library
  • Loading data and tokenizer then tokenizing the sentences. Tokenizer splits the sentences in words.
  • Getting inputs ids, segments ids and attention masks from the ids
  • Splitting the data into train and validation sets
  • Loading and choosing BERT model (BERT classification in our case)
  • Specifying optimizer and scheduler
  • Training and evaluating the model and then getting inference based on our examples. 
  • For example:- If we enter the input “you doing good”  the model predicts ‘Grammatically InCorrect’ which is indeed Correct

Closing Notes

That’s all for now. Understanding BERT is not an easy thing per se but if you follow the basics and keep making small projects like these it becomes really intuitive. A good starting point is the NLP Specialization on Coursera by deep and,

Google Colab link for the code

More Great AIM Stories

Ravi Tanwar
Data Scientist working at a cognitive search based company. Kaggle 2X Expert and interested in Data Science, Machine Learning, Sequential Data and Music

AIM Upcoming Events

Early Bird Passes expire on 3rd Feb

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Do machines feel pain?

Scientists worldwide have been finding ways to bring a sense of awareness to robots, including feeling pain, reacting to it, and withstanding harsh operating conditions.

IT professionals and DevOps say no to low-code

The obsession with low-code is led by its drag-and-drop interface, which saves a lot of time. In low-code, every single process is shown visually with the help of a graphical interface that makes everything easier to understand.

Neuralink elon musk

What could go wrong with Neuralink?

While the broad aim of developing such a BCI is to allow humans to be competitive with AI, Musk wants Neuralink to solve immediate problems like the treatment of Parkinson’s disease and brain ailments.

Understanding cybersecurity from machine learning POV 

Today, companies depend more on digitalisation and Internet-of-Things (IoT) after various security issues like unauthorised access, malware attack, zero-day attack, data breach, denial of service (DoS), social engineering or phishing surfaced at a significant rate.