The last few years have witnessed a wider adoption of Transformer architecture in natural language processing (NLP) and natural language understanding (NLU). Bidirectional Encoder Representations from Transformers or BERT set new benchmarks for NLP when it was introduced by Google AI Research in 2018. The model has paved the way to newer and enhanced models.
Here is a compilation of the top ten alternatives of the popular language model BERT for natural language understanding (NLU) projects.
1| GPT-2 and GPT-3 by OpenAI
In 2019, OpenAI rolled out GPT-2 — a transformer-based language model with 1.5 Billion parameters and trained on 8 million web pages. The model comes armed with a broad set of capabilities, including the ability to generate conditional synthetic text samples of good quality.
OpenA launched GPT-3 as the successor to GPT-2 in 2020. GPT-3 is an autoregressive language model with 175 billion parameters, ten times more than any previous non-sparse language model. The model, equipped with few-shot learning capability, can generate human-like text and even write code from minimal text prompts.
Know more here.
2| XLNet by Carnegie Mellon University
XLNet is a generalised autoregressive pretraining method for learning bidirectional contexts by maximising the expected likelihood over all permutations of the factorization order. XLNet uses Transformer-XL and is good at language tasks involving long context. Due to its autoregressive formulation, the model performs better than BERT on 20 tasks, including sentiment analysis, question answering, document ranking and natural language inference.
Know more here.
3| RoBERTa by Facebook
Developed by Facebook, RoBERTa or a Robustly Optimised BERT Pretraining Approach is an optimised method for pretraining self-supervised NLP systems. The model is built on the language modelling strategy of BERT that allows RoBERTa to predict intentionally hidden sections of text within otherwise unannotated language examples. It also modifies key hyperparameters in BERT, including removing BERT’s next-sentence pretraining objective and training with much larger mini-batches and learning rates.
Know more here.
4| ALBERT by Google
ALBERT or A Lite BERT for Self-Supervised Learning of Language Representations is an enhanced model of BERT introduced by Google AI researchers. The model incorporates two parameter reduction techniques to overcome major obstacles in scaling pre-trained models. According to its developers, the success of ALBERT demonstrated the significance of distinguishing the aspects of a model that give rise to the contextual representations. It has significantly fewer parameters than a traditional BERT architecture.
Know more here.
5| DistilBERT by Hugging Face
DistilBERT is a distilled version of BERT. DistilBERT is a general-purpose pre-trained version of BERT, 40% smaller, 60% faster and retains 97% of the language understanding capabilities.
Know more here.
6| StructBERT by Alibaba
Developed by the researchers at Alibaba, StructBERT is an extended version of the traditional BERT model. StructBERT incorporates language structures into BERT pre-training by proposing two linearisation strategies. In addition to the existing masking strategy, StructBERT extends BERT by leveraging the structural information, such as word-level ordering and sentence-level ordering. According to its developers, StructBERT advances the state-of-the-art results on a variety of NLU tasks, including the GLUE benchmark, the SNLI dataset and SQuAD v1.1 question answering task.
Know more here.
7| DeBERTa by Microsoft
DeBERTa or Decoding-enhanced BERT with Disentangled Attention is a Transformer-based neural language model that improves the BERT and RoBERTa models using two novel techniques such as a disentangled attention mechanism and an enhanced mask decoder. DeBERTa is pre-trained using MLM.
Know more here.
8| Text-to-Text Transfer Transformer (T5) by Google
Text-to-Text Transfer Transformer (T5) is a unified framework that converts all text-based language problems into a text-to-text format. In contrast to BERT-style models that can only output either a class label or a span of the input, T5 reframes all NLP tasks into a unified text-to-text-format where the input and output are always text strings. The text-to-text framework allows the use of the same model, loss function, and hyperparameters on any NLP task, including machine translation, document summarisation, question answering as well as classification tasks.
Know more here.
9| UniLM by Microsoft
Developed by Microsoft, UniLM or Unified Language Model is pre-trained using three types of language modeling tasks: unidirectional, bidirectional, and sequence-to-sequence prediction. The unified modeling is achieved by employing a shared Transformer network and utilising specific self-attention masks to control what context the prediction conditions on. The model can be fine-tuned for both natural language understanding and generation tasks. UNILM achieved state-of-the-art results on five natural language generation datasets, including improving the CNN/DailyMail abstractive summarisation ROUGE-L.
Know more here.
10| Reformer by Google
Reformer is a Transformer model designed to handle context windows of up to one million words; all on a single accelerator. Introduced by Google AI researchers, the model takes up only 16GB memory and combines two fundamental techniques to solve the problems of attention and memory allocation that limit the application of Transformers to long context windows.
Know more here.