Microsoft’s New BERT Model Surpasses Human Performance on SuperGLUE Benchmark

Advertisement

Researchers at Microsoft Dynamics 365 AI and Microsoft Research have introduced a new BERT model architecture known as DeBERTa or Decoding-enhanced BERT with dis-entangled attention. The new model is claimed to improve the performance of Google’s BERT and Facebook’s RoBERTa models. A single 1.5B DeBERTa model outperformed T5 with 11 billion parameters on the SuperGLUE benchmark and surpassed the human baseline.

The introduction of Transformer such as BERT is one of the many groundbreaking achievements in the natural language processing field. A Transformer-based language model (LM) is made up of stacked Transformer blocks. Each block has a multi-head self-attention layer succeeded by a fully connected positional feed-forward network.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

Transformers have become the most effective neural network architecture for neural language modelling. Last two years have seen a rise in the number of large-scale Transformer-based pre-trained language models such as BERT, GPT, XLNet, RoBERTa, ELECTRA, StructBERT, and more. 

Tech Behind DeBERTa 

DeBERTa is a new Transformer-based neural language model that proposes a disentangled self-attention mechanism.

DeBERTa includes two new techniques to improve BERT and RoBERTa:

  • The first technique refers to the disentangled attention mechanism. In this mechanism, each word is represented using two vectors that encode its content and position, respectively. The attention weights among words are computed using disentangled matrices on their contents and relative positions during the process. This technique is meant to address a limitation of the relative positions fully captured by the disentangled attentions.
  • The second technique is known as an enhanced mask decoder, which is used to incorporate the absolute positions in the decoding layer to predict the masked tokens in model pre-training. This technique is meant to enable generation tasks and a multi-task learning objective.

According to the researchers, the new techniques can significantly enhance the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural language generation (NLG) tasks. 

How Is It Different

Unlike the popular BERT model, where each word in the input layer is denoted using a vector that is the sum of its word embedding and the position embedding, in DeBERTa, every word is denoted using two vectors meant to encode its content and position, respectively. Also, the attention weights among the words are estimated using the disentangled matrices based on their contents and relative position.

The researchers stated, “As an extension to the disentangled attention, we improve the output layer of the BERT model for pre-training such that to address the limitation of relative positions. We observe that in some situations, it is challenging for the relative positions only mechanism to accurately predict the masking tokens.”

“DeBERTa includes absolute word position embeddings in the softmax layer where the model has the ability to decode the masked words based on the aggregated contextual embeddings of word contents and positions,” the researchers added.

Wrapping Up

The new BERT model will introduce additional information of a sentence besides the positions to the pre-training. Also, DeBERTa is the first-ever language model that proposed the disentangled attention mechanism.

Read the paper here.

More Great AIM Stories

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Our Upcoming Events

Conference, in-person (Bangalore)
MachineCon 2022
24th Jun

Conference, Virtual
Deep Learning DevCon 2022
30th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
MORE FROM AIM
Sreejani Bhattacharyya
Why is edtech falling first?

With the lockdown being imposed due to the COVID-19 pandemic and schools being shut down, the edtech startups witnessed some of their best times during 2020 and 2021.