Active Hackathon

BigBird – Google’s ‘Brahmastra’ For NLP Supremacy?

BigBird - Google’s ‘Brahmastra’ For NLP Supremacy?

Google transformer-based models like BERTshowcased immense success with NLP tasks; however, came with a significant limitation of quadratic dependency in-memory storage for the sequence length. A lot of this could be attributed to its full attention mechanism for sequence lengths. Such a self-attention mechanism can create several challenges for processing longer text sequences. Thus to omit that problem, Google released its new deep learning model — BigBird, with sparse attention mechanism, which eliminates the limitation with linear dependency.

According to a recent paper on BigBird, researchers showcased how, despite having sparse attention, the new model — BigBird can preserve properties of quadratic, full attention models. This ability has enabled the model to showcase an enhanced performance in processing eight times longer text sequences than other transformer models.


Sign up for your weekly dose of what's up in emerging technology.

Further, the researchers also answered two significant queries with BigBird — firstly, achieving practical benefits of the fully quadratic self-attention scheme using fewer inner-products; and secondly, preserving expressivity and flexibility of the original network with sparse attention mechanisms.

Also Read: How To Use BERT Transformer For Grammar Checking?

BigBird’s Sparse Attention Mechanism For Longer Sequences

For theoretical experimentation, researchers, in the paper, showcased how the sparse attention mechanism can be as beneficial and expressive as a full-attention mechanism. To facilitate this, researchers firstly highlighted the use of sparse attention mechanisms for standalone encoders like the BERT model, then turned into Universal Approximators of sequence to sequence function. And, secondly, how sparse encoder-decoder transformers are Turing Complete. 

While experimenting, researchers realised that the main challenge of the sparse attention mechanism is the ability to compute the contextual mapping. Thus, to get around that, researchers developed a sparse shift operator for managing the entries of matrices. Further, unlike other transformer models where the full attention is applied directly, the sparse attention mechanism works token by token. Therefore to use the proposed mechanism, they also needed to define a suitable modification of tokens to control their reactions.

With these steps, the researchers were also able to display that sparse attention mechanism comes with an additional cost, due to its requirement of polynomially more layers. 

For practical experimentations, researchers chose NLP and genomics tasks to highlight BigBird’s capability. 

Image 01

For the NLP experiment, researchers chose three representative tasks — basic masked language modelling for the better contextual representation of longer sequences; handling longer sequences for Q&A and extended document classification for extracting information. Following the experiments, the researchers highlighted that with an 8x increase of the text sequence, the model was able to achieve extraordinary performance for the tasks above.

Image 02: Fine-tuning results for QA tasks

The tasks chosen for the comparison were very competitive, therefore required multiple highly engineered systems for confirming each dataset to its respective output formats. To have fair and accurate results, researchers had to use some additional regularisation for training the model.

Image 03: Result of the document classification task.

Coming to the genomics tasks, the researchers used a dataset built on EPDnew and the report F1 on a test dataset to fine-tune the BigBird model. While comparing it with the previous models, it has been noted that the proposed model achieved a 5% jump in the accuracy.

Also Read: How I used BERT to Analyse Twitter Data

BigBird vs BERT

When compared Google’s BigBird with other transformer models like BERT, the proposed model outsmarted both RoBERTA and Longformer with extraordinary results in all the four question-answer datasets — HotpotQA, NaturalQ, TriviaQA, and WikiHop. (Image 02)

Further, when the model has been worked on the document classification task, it was compared on IMDB, Yelp-5, ArXiv, Patents and Hyperpartisan datasets. This result highlights BigBird’s remarkable capability on the ArXiv dataset surpassing RoBERTA and SoTA with an F1 score of 92.31%. (Image 03)

Along with these, Google’s BigBird was also compared with CNNProm and DeePromoter on genomics tasks. Here, the results highlighted that the proposed model had achieved 99.9% accuracy with a 5% increase than the previous models.

Also Read: How Syntactic Biases Help BERT To Achieve Better Language Understanding

Wrapping Up

With these results in hand, it can be established that with a sequence length of 4,096, the BigBird Model can provide an accurate and precise result on theoretical as well as practical experiments. According to the researchers, the results are complemented by showcasing that moving to the sparse attention mechanism can bring advantages but would incur a cost.

Read the whole paper here.

More Great AIM Stories

Sejuti Das
Sejuti currently works as Associate Editor at Analytics India Magazine (AIM). Reach out at

Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM

Data Science Skills Survey 2022 – By AIM and Great Learning

Data science and its applications are becoming more common in a rapidly digitising world. This report presents a comprehensive view to all the stakeholders — students, professionals, recruiters, and others — about the different key data science tools or skillsets required to start or advance a career in the data science industry.

How to Kill Google Play Monopoly

The only way to break Google’s monopoly is to have localised app stores with an interface as robust as Google’s – and this isn’t an easy ask. What are the options?