What Is Google’s Recently Launched BigBird

Recently, Google Research introduced a new sparse attention mechanism that improves performance on a multitude of tasks that require long contexts known as BigBird. The researchers took inspiration from the graph sparsification methods.

They understood where the proof for the expressiveness of Transformers breaks down when full-attention is relaxed to form the proposed attention pattern. They stated, “This understanding helped us develop BigBird, which is theoretically as expressive and also empirically useful.”

Why is BigBird Important?

Bidirectional Encoder Representations from Transformers or BERT, a neural network-based technique for natural language processing (NLP) pre-training has gained immense popularity in the last two years. This technology enables anyone to train their own state-of-the-art question answering system. 

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

However, one of the core limitations of this technique is the quadratic dependency, mainly in terms of memory on the sequence length due to their full attention mechanism. This also increases the cost when it comes to using transformer-based models for processing long sequences. To mitigate this issue, the researchers introduced BigBird.

Behind BigBird

BigBird is a universal approximator of sequence functions which is designed mainly to satisfy all the known theoretical properties of full transformers. According to the researchers, this sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware.

In particular, the BigBird consists of three main parts:

  • A set of global tokens that attends to all parts of the sequence.
  • A set of random keys for each query.
  • A block of local neighbours so that each node attends to their local structure.

Dataset Used

To train the encoder of the model, the researchers used four challenging datasets, which are-

1| Natural Questions: Natural Questions corpus is a question answering dataset. The dataset consists of 307,373 training examples with single annotations, 7,830 cases with 5-way annotations for development data, and a further 7,842 examples 5-way annotated sequestered as test data. 

2| HotpotQA-distractor: HotpotQA is a large-scale dataset with 113k Wikipedia-based question-answer pairs. The dataset is collected by crowdsourcing based on Wikipedia articles, where crowd workers are shown multiple supporting context documents and asked explicitly to come up with questions requiring reasoning about all of the documents.

3| TriviaQA-wiki: TriviaQA is a large-scale challenging reading comprehension dataset containing over 650K question-answer-evidence triples. The dataset includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high-quality distant supervision for answering the questions.

4| WikiHop: WikiHop dataset consists of sets of Wikipedia articles where answers to queries about specific properties of an entity cannot be located in the entity’s report. 

Contributions of This Research

The main contributions of this research are-

  • BigBird satisfies all the known theoretical properties of a full transformer. In particular, the researchers showed that adding extra tokens allows one to express all continuous sequence-to-sequence functions with only O(n)-inner products. Also, they showed that under standard assumptions regarding precision, BigBird is Turing complete.
  • They showed that the extended context modelled by BigBird greatly benefits a variety of NLP tasks. In particular, the researchers achieved state-of-the-art results for question-answering and document summarisation on several different datasets.
  • Lastly, they introduced a novel application of attention-based models where long contexts are beneficial, such as extracting contextual representations of genomics sequences like DNA. Also, with longer masked LM pretraining, BigBird improves performance on downstream tasks such as promoter-region and chromatin profile prediction.

Wrapping Up

BigBird satisfies many theoretical results, such as the technique is a universal approximator of sequence to sequence functions and is Turing complete. Considering the consequence of the capability to handle longer context, BigBird drastically improves performance on various NLP tasks such as question answering and long document summarisation. 

Furthermore, the researchers also proposed novel applications to genomics data by introducing an attention-based contextual language model for DNA and fine-tune it for downstream tasks such as promoter region prediction and predicting effects of non-coding variants.
Read the paper here.

More Great AIM Stories

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM