Researchers at Microsoft Dynamics 365 AI and Microsoft Research have introduced a new BERT model architecture known as DeBERTa or Decoding-enhanced BERT with dis-entangled attention. The new model is claimed to improve the performance of Google’s BERT and Facebook’s RoBERTa models. A single 1.5B DeBERTa model outperformed T5 with 11 billion parameters on the SuperGLUE benchmark and surpassed the human baseline.
The introduction of Transformer such as BERT is one of the many groundbreaking achievements in the natural language processing field. A Transformer-based language model (LM) is made up of stacked Transformer blocks. Each block has a multi-head self-attention layer succeeded by a fully connected positional feed-forward network.
Sign up for your weekly dose of what's up in emerging technology.
Transformers have become the most effective neural network architecture for neural language modelling. Last two years have seen a rise in the number of large-scale Transformer-based pre-trained language models such as BERT, GPT, XLNet, RoBERTa, ELECTRA, StructBERT, and more.
Tech Behind DeBERTa
DeBERTa is a new Transformer-based neural language model that proposes a disentangled self-attention mechanism.
DeBERTa includes two new techniques to improve BERT and RoBERTa:
- The first technique refers to the disentangled attention mechanism. In this mechanism, each word is represented using two vectors that encode its content and position, respectively. The attention weights among words are computed using disentangled matrices on their contents and relative positions during the process. This technique is meant to address a limitation of the relative positions fully captured by the disentangled attentions.
- The second technique is known as an enhanced mask decoder, which is used to incorporate the absolute positions in the decoding layer to predict the masked tokens in model pre-training. This technique is meant to enable generation tasks and a multi-task learning objective.
According to the researchers, the new techniques can significantly enhance the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural language generation (NLG) tasks.
How Is It Different
Unlike the popular BERT model, where each word in the input layer is denoted using a vector that is the sum of its word embedding and the position embedding, in DeBERTa, every word is denoted using two vectors meant to encode its content and position, respectively. Also, the attention weights among the words are estimated using the disentangled matrices based on their contents and relative position.
The researchers stated, “As an extension to the disentangled attention, we improve the output layer of the BERT model for pre-training such that to address the limitation of relative positions. We observe that in some situations, it is challenging for the relative positions only mechanism to accurately predict the masking tokens.”
“DeBERTa includes absolute word position embeddings in the softmax layer where the model has the ability to decode the masked words based on the aggregated contextual embeddings of word contents and positions,” the researchers added.
The new BERT model will introduce additional information of a sentence besides the positions to the pre-training. Also, DeBERTa is the first-ever language model that proposed the disentangled attention mechanism.
Read the paper here.