Listen to this story
In Part 1, we discussed how Transformers form the crux of NLP. So let’s take a look at autoregressive and sequence-to-sequence models.
Autoregressive models are trained on the language modeling task, predicting the next word based on the context. They function as the decoder part of the original transformer model, using a mask to only consider previous words during attention. While these models can be fine-tuned for various tasks, their primary use is text generation. Let’s take a look at some of them.
In June 2018, OpenAI released GPT, the first pretrained Transformer model, used for fine-tuning on various NLP tasks and obtained state-of-the-art results. By training a language model on diverse, unlabeled text and using task-aware input transformations during fine-tuning, significant improvements were achieved across various tasks. Outperforming task-specific models, this approach demonstrates remarkable gains, such as 8.9% in commonsense reasoning, 5.7% in question answering, and 1.5% in textual entailment. It shows the potential of unsupervised (pre-)training in enhancing performance on discriminative tasks and provides insights into the effectiveness of Transformers with data containing long-range dependencies.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
An upgraded version of GPT, OpenAI introduced GPT 2 pretrained on a large dataset called WebText. By providing the model with a document and questions, it performs really well on the CoQA dataset without needing lots of training examples. Making the language model even bigger improves its performance on various tasks. GPT-2 outperformed other models on several language modeling datasets and produces more coherent text samples.
In 2019, Salesforce introduced CTRL, a powerful 1.63 billion-parameter language model. CTRL used control codes that dictate style, content, and task-specific behavior. These codes are derived from natural text structure, enabling precise text generation while retaining unsupervised learning advantages. CTRL can also predict the most probable parts of training data for a sequence, offering a way to analyze vast data through model-based source attribution.
Carnegie Mellon University and Google worked on Transformer-XL for learning long-range dependencies without disrupting temporal coherence. It incorporates segment-level recurrence and a novel positional encoding scheme, enabling it to capture dependencies 80% longer than RNNs and 450% longer than traditional Transformers. This leads to improved performance on both short and long sequences and achieves remarkable speedup during evaluation, up to 1,800+ times faster than vanilla Transformers.
Again by Google Research, transformer based Reformer is designed with new methods to reduce memory usage and computation time. These tricks include using Axial position encoding to handle long sequences without a huge positional encoding matrix. Instead of traditional attention, it uses LSH (local-sensitive hashing) attention to save computation during attention layers. Intermediate results are not stored for each layer, but rather obtained during the backward pass using reversible transformer layers or recomputed as needed, saving memory. Additionally, feedforward operations are computed in smaller chunks instead of the entire batch. As a result, this model can handle much longer sentences compared to traditional autoregressive transformers.
Google’s XLNet rearranges the sentence’s tokens and then predicts the next token using the preceding tokens. This is achieved through a masked approach, where a specific permutation of the sentence is hidden, allowing the model to understand the correct sequence. Moreover, XLNet employs Transformer-XL’s recurrence mechanism to establish connections between distant tokens for long-term dependencies. The library includes various versions of the model suitable for language modeling, token classification, sentence classification, multiple choice classification, and question answering tasks.
Sequence-to-sequence models combine the transformer’s encoder and decoder to handle various tasks, including translation, summarization, and question answering. They can be adapted to different tasks but are most commonly used for translation, summarization, and question answering.
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
BART by Meta is a sequence-to-sequence model for language tasks like writing, translation, and understanding. It consists of an encoder and a decoder. The encoder processes corrupted tokens, while the decoder works with the original tokens, masking future words. During pretraining, the encoder applies various transformations like masking random tokens, deleting tokens, masking spans of tokens with a single mask token, permuting sentences, and rotating the document to start at a specific token.
Following the same encoder-decoder model architecture as BART, PEGASUS employs two self-supervised objectives for pre-training: Masked Language Modeling (MLM) and Gap Sentence Generation (GSG) for summarization. In MLM, random tokens are masked and predicted by the encoder, akin to BERT. In GSG, entire encoder input sentences are masked and fed to the decoder, which has a causal mask to predict future words. Unlike BART, Pegasus’ pretraining task closely resembles summarization, where significant sentences are masked and generated together as one output sequence from the remaining sentences, resembling an extractive summary.
Microsoft developed Marian, a powerful and self-contained Neural Machine Translation system. It includes an integrated automatic differentiation engine based on dynamic computation graphs. Marian is entirely written in C++. The system’s encoder-decoder framework is designed to achieve both high training efficiency and fast translation speeds, making it a research-friendly toolkit.
In T5 by Google, the traditional transformer model is modified with positional embeddings learned at each layer. It handles various NLP tasks by transforming them into text-to-text challenges using specific prefixes like “summarize:”, “question:”, “translate English to German:”, etc. The pretraining involves both supervised and self-supervised training. Supervised training uses GLUE and SuperGLUE benchmarks as downstream tasks, converted into text-to-text format. Self-supervised training involves corrupting tokens in the input sentence, randomly removing 15% of them, and replacing them with individual sentinel tokens. The encoder takes the corrupted sentence, the decoder takes the original sentence, and the target includes the dropped-out tokens delimited by their sentinel tokens.
Meta’s MBart has a similar structure and training goal as BART, but it stands out by being trained in 25 different languages. Its main purpose is to excel in both supervised and unsupervised machine translation tasks. MBart pioneers a novel approach by pre-training the entire sequence-to-sequence model on diverse languages, using denoising techniques on full texts.