Last updated September 20, 2020
In AI Mysteries

Top 8 Baselines For NLP Models

Share

Published on March 10, 2020

by Ram Sagar

The ability of natural language in machines, so far, has been elusive. However, the last couple of years, at least since the advent of Google’s BERT model, there has been tremendous innovation in this space. With NVIDIA and Microsoft releasing mega models with tens of millions of parameters, it is safe to say that we are at the cusp of a major breakthrough.

Here we list top NLP models based on GLUE benchmark:

StructBERT By Alibaba

Score: 90.3

The StructBERT with structural pre-training gives surprisingly good empirical results on a variety of downstream tasks, including pushing the state-of-the-art on the GLUE benchmark to 89.0 (outperforming all published models), the F1 score on SQuAD v1.1 question answering to 93.0, the accuracy on SNLI to 91.7.

T5 By Google

Score: 90.3

This model by Google demonstrated how to achieve state-of-the-art results on multiple NLP tasks using a text-to-text transformer pre-trained on a large text corpus.

A model is first pre-trained on a data-rich task before being fine-tuned on a downstream task. This is an application of transfer learning in NLP has emerged as a powerful technique in natural language processing (NLP). With T5, the landscape of transfer learning techniques for NLP was explored and a unified framework was introduced that converts every language problem into a text-to-text format.

ERNIE By Baidu

Score: 90.1

Baidu’s ERNIE 2.0 is a framework for language understanding in which pre-training tasks can be incrementally built and learned through multi-task learning.

And according to the experimental results on all Chinese datasets, ERNIE 2.0 model comprehensively outperforms BERT on all of the 9 Chinese datasets. Baidu’s model also achieved the best performance and created new state-of-the-art results on these Chinese NLP tasks.

MT-DNN By Microsoft

Score: 89.9

Multi-Task Deep Neural Network or MT-DNN leverages large amounts of cross-task data while benefiting from a regularisation effect that leads to more general representations. This helps the model to adapt to new tasks. MT-DNN incorporates a pre-trained BERT and has obtained new state-of-the-art results on ten NLU tasks, including SNLI, SciTail, and eight out of nine GLUE tasks, pushing the GLUE benchmark to 82.7%.

ELECTRA Large

Score: 89.4

In this approach, introduced in the paper on ELECTRA, instead of training a model that predicts the original identities of the corrupted tokens, the researchers trained a discriminative model that predicts whether each token in corrupted inputs was replaced by a generator sample or not. This new pre-training task is more efficient than MLM. The contextual representations learned by this approach outperformed the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for four days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark.

FreeLB RoBERTa By Microsoft

Score: 88.4

FreeLB, adds adversarial perturbations to word embeddings to promote higher invariance and minimise the resultant adversarial risk inside different regions around input samples.

Experiments on the GLUE benchmark show that it is able to improve the overall test scores of BERT-base model from 78.3 to 79.4, and RoBERTa-large model from 88.5 to 88.8.

HIRE-RoBERTa By Junjie Yang

Score: 88.3

In this approach, the author Junjie Yang and Hai Zhao, argue that single layer output limits the effectiveness of pre-trained representation. So, they deepen the representation learned by the model by fusing the hidden representation in terms of an explicit HIdden Representation Extractor (HIRE), which automatically absorbs the complementary representation with respect to the output from the final layer. Utilising RoBERTa as the backbone encoder, our proposed improvement over the pre-trained models is shown effective on multiple natural language understanding tasks and helps our model rival with the state-of-the-art models on the GLUE benchmark.

ROBERTa By Facebook AI

Score: 88.1

RoBERTa iterates on BERT’s pretraining procedure, including training the model longer, with 125M parameters; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data. RoBERTa gained great popularity since its introudction, which can also be witnessed as its variants are implemented in the models listed above.

General Language Understanding Evaluation (GLUE) benchmark was introduced by researchers at NYU and DeepMind, as a collection of tools that evaluate the performance of models for various NLU tasks. The toolkit also includes a hand-crafted diagnostic test suite that enables detailed linguistic analysis of models.

GLUE benchmark is designed to be model-agnostic and the benchmark tasks are selected so as to favour models that share information across tasks using parameter sharing or other transfer learning techniques.

Access all our open Survey & Awards Nomination forms in one place