Recently, Google Research introduced a new sparse attention mechanism that improves performance on a multitude of tasks that require long contexts known as BigBird. The researchers took inspiration from the graph sparsification methods.
They understood where the proof for the expressiveness of Transformers breaks down when full-attention is relaxed to form the proposed attention pattern. They stated, “This understanding helped us develop BigBird, which is theoretically as expressive and also empirically useful.”
Why is BigBird Important?
Bidirectional Encoder Representations from Transformers or BERT, a neural network-based technique for natural language processing (NLP) pre-training has gained immense popularity in the last two years. This technology enables anyone to train their own state-of-the-art question answering system.
However, one of the core limitations of this technique is the quadratic dependency, mainly in terms of memory on the sequence length due to their full attention mechanism. This also increases the cost when it comes to using transformer-based models for processing long sequences. To mitigate this issue, the researchers introduced BigBird.
BigBird is a universal approximator of sequence functions which is designed mainly to satisfy all the known theoretical properties of full transformers. According to the researchers, this sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware.
In particular, the BigBird consists of three main parts:
- A set of global tokens that attends to all parts of the sequence.
- A set of random keys for each query.
- A block of local neighbours so that each node attends to their local structure.
To train the encoder of the model, the researchers used four challenging datasets, which are-
1| Natural Questions: Natural Questions corpus is a question answering dataset. The dataset consists of 307,373 training examples with single annotations, 7,830 cases with 5-way annotations for development data, and a further 7,842 examples 5-way annotated sequestered as test data.
2| HotpotQA-distractor: HotpotQA is a large-scale dataset with 113k Wikipedia-based question-answer pairs. The dataset is collected by crowdsourcing based on Wikipedia articles, where crowd workers are shown multiple supporting context documents and asked explicitly to come up with questions requiring reasoning about all of the documents.
3| TriviaQA-wiki: TriviaQA is a large-scale challenging reading comprehension dataset containing over 650K question-answer-evidence triples. The dataset includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high-quality distant supervision for answering the questions.
4| WikiHop: WikiHop dataset consists of sets of Wikipedia articles where answers to queries about specific properties of an entity cannot be located in the entity’s report.
Contributions of This Research
The main contributions of this research are-
- BigBird satisfies all the known theoretical properties of a full transformer. In particular, the researchers showed that adding extra tokens allows one to express all continuous sequence-to-sequence functions with only O(n)-inner products. Also, they showed that under standard assumptions regarding precision, BigBird is Turing complete.
- They showed that the extended context modelled by BigBird greatly benefits a variety of NLP tasks. In particular, the researchers achieved state-of-the-art results for question-answering and document summarisation on several different datasets.
- Lastly, they introduced a novel application of attention-based models where long contexts are beneficial, such as extracting contextual representations of genomics sequences like DNA. Also, with longer masked LM pretraining, BigBird improves performance on downstream tasks such as promoter-region and chromatin profile prediction.
BigBird satisfies many theoretical results, such as the technique is a universal approximator of sequence to sequence functions and is Turing complete. Considering the consequence of the capability to handle longer context, BigBird drastically improves performance on various NLP tasks such as question answering and long document summarisation.
Furthermore, the researchers also proposed novel applications to genomics data by introducing an attention-based contextual language model for DNA and fine-tune it for downstream tasks such as promoter region prediction and predicting effects of non-coding variants.
Read the paper here.