Google transformer-based models like BERTshowcased immense success with NLP tasks; however, came with a significant limitation of quadratic dependency in-memory storage for the sequence length. A lot of this could be attributed to its full attention mechanism for sequence lengths. Such a self-attention mechanism can create several challenges for processing longer text sequences. Thus to omit that problem, Google released its new deep learning model — BigBird, with sparse attention mechanism, which eliminates the limitation with linear dependency.
According to a recent paper on BigBird, researchers showcased how, despite having sparse attention, the new model — BigBird can preserve properties of quadratic, full attention models. This ability has enabled the model to showcase an enhanced performance in processing eight times longer text sequences than other transformer models.
Further, the researchers also answered two significant queries with BigBird — firstly, achieving practical benefits of the fully quadratic self-attention scheme using fewer inner-products; and secondly, preserving expressivity and flexibility of the original network with sparse attention mechanisms.
BigBird’s Sparse Attention Mechanism For Longer Sequences
For theoretical experimentation, researchers, in the paper, showcased how the sparse attention mechanism can be as beneficial and expressive as a full-attention mechanism. To facilitate this, researchers firstly highlighted the use of sparse attention mechanisms for standalone encoders like the BERT model, then turned into Universal Approximators of sequence to sequence function. And, secondly, how sparse encoder-decoder transformers are Turing Complete.
While experimenting, researchers realised that the main challenge of the sparse attention mechanism is the ability to compute the contextual mapping. Thus, to get around that, researchers developed a sparse shift operator for managing the entries of matrices. Further, unlike other transformer models where the full attention is applied directly, the sparse attention mechanism works token by token. Therefore to use the proposed mechanism, they also needed to define a suitable modification of tokens to control their reactions.
With these steps, the researchers were also able to display that sparse attention mechanism comes with an additional cost, due to its requirement of polynomially more layers.
For practical experimentations, researchers chose NLP and genomics tasks to highlight BigBird’s capability.
For the NLP experiment, researchers chose three representative tasks — basic masked language modelling for the better contextual representation of longer sequences; handling longer sequences for Q&A and extended document classification for extracting information. Following the experiments, the researchers highlighted that with an 8x increase of the text sequence, the model was able to achieve extraordinary performance for the tasks above.
Image 02: Fine-tuning results for QA tasks
The tasks chosen for the comparison were very competitive, therefore required multiple highly engineered systems for confirming each dataset to its respective output formats. To have fair and accurate results, researchers had to use some additional regularisation for training the model.
Image 03: Result of the document classification task.
Coming to the genomics tasks, the researchers used a dataset built on EPDnew and the report F1 on a test dataset to fine-tune the BigBird model. While comparing it with the previous models, it has been noted that the proposed model achieved a 5% jump in the accuracy.
Also Read: How I used BERT to Analyse Twitter Data
BigBird vs BERT
When compared Google’s BigBird with other transformer models like BERT, the proposed model outsmarted both RoBERTA and Longformer with extraordinary results in all the four question-answer datasets — HotpotQA, NaturalQ, TriviaQA, and WikiHop. (Image 02)
Further, when the model has been worked on the document classification task, it was compared on IMDB, Yelp-5, ArXiv, Patents and Hyperpartisan datasets. This result highlights BigBird’s remarkable capability on the ArXiv dataset surpassing RoBERTA and SoTA with an F1 score of 92.31%. (Image 03)
Along with these, Google’s BigBird was also compared with CNNProm and DeePromoter on genomics tasks. Here, the results highlighted that the proposed model had achieved 99.9% accuracy with a 5% increase than the previous models.
With these results in hand, it can be established that with a sequence length of 4,096, the BigBird Model can provide an accurate and precise result on theoretical as well as practical experiments. According to the researchers, the results are complemented by showcasing that moving to the sparse attention mechanism can bring advantages but would incur a cost.
Read the whole paper here.