Natural Language Processing (NLP) is one of the most diversified domains in emerging tech. Last year, search engine giant Google open-sourced a technique known as Bi-directional Encoder Representations from Transformers (BERT) for NLP pre-training. This model helped the researchers to train a number of state-of-the-art models in about 30 minutes on a single Cloud TPU, or in a few hours using a single GPU.
Now, the researchers at Google designed A Lite BERT (ALBERT) which is a modified version of the traditional BERT model. The latest model incorporates two-parameter reduction techniques which are factorised embedding parameterisation and cross-layer parameter sharing in order for lifting the major obstacles in scaling pre-trained models in NLP. ALBERT has outperformed the benchmark tests for natural language understanding (NLU) which are GLUE, RACE, and SQuAD 2.0.
How It Is Better Than Traditional BERT
A model like BERT-large can provide a poor-quality performance when one tries to simply enlarge the hidden size of the model. Parameter-reduction technique such as factorised embedding parameterisation is used to separate the size of the hidden layers from the size of vocabulary embedding which makes it easy to grow the hidden size without significantly increasing the parameter size. While the cross-layer parameter sharing prevents the parameter from growing with the depth of the network. Thus, both the techniques significantly reduce the number of parameters for traditional BERT without worsening the performance and improving parameter-efficiency. The performance of ALBERT is further improved by introducing a self-supervised loss for sentence-order prediction (SOP).
There are three main contributions that ALBERT makes over the design choices of BERT and they are mentioned below
- Factorized Embedding Parameterization: The researchers proposed this method to grow the hidden size of the model without making any increase in the size of the parameter.
- Cross-Layer Parameter Sharing: The researchers proposed cross-layer parameter sharing in order to improve the efficiency of the parameter.
- Inter-Sentence Coherence Loss: The researchers proposed a loss based primarily on coherence in order to maintain inter-sentence modeling in language understanding.
Other BERT-Inspired Models
RoBERTa
RoBERTa or Robustly optimised BERT is an optimised method for pretraining NLP systems which improves on Bidirectional Encoder Representations from Transformers (BERT). The model is built on BERT’s language masking strategy where the system learns to predict intentionally hidden sections of text within otherwise unannotated language examples. It surpasses the benchmark tests for NLU i.e. GLUE, RACE, and SQuAD. The researchers claimed that RoBERTa provides a large improvement over the originally reported BERT-large as well as XLNet-large results.
ViLBERT
Researchers from Georgia Institute of Technology, Facebook AI Research and Oregon State University have developed a model known as ViLBERT, short for Vision-and-Language BERT.b It is built to learn task-agnostic joint representations of image content as well as natural language. The model includes two parallel BERT-style models which are mainly operating over image regions and text segments. The model has outperformed task-specific state-of-the-art models across the four tasks which are Visual Question Answering (VQA), Visual Commonsense Reasoning (VCR), Grounding Referring Expressions, and Caption-Based Image Retrieval.
MT-DNN
In May 2019, the researchers at Microsoft presented a Multi-Task Deep Neural Network (MT-DNN) for learning representations across multiple natural language understanding (NLU) tasks. This model incorporates Google’s BERT AI to achieve state-of-the-art results and obtains new state-of-the-art results on ten NLU tasks, including SNLI, SciTail, and eight out of nine GLUE tasks. Multi-Task Deep Neural Network is a combination of multi-task learning and language model pre-training which combines four types of NLU tasks. They are single-sentence classification, pairwise text classification, text similarity scoring, and relevance ranking.