Recently, Google’s team introduced PaLM, a 540 billion parameter dense decoder-only Transformer model that is trained with Google’s own Pathway systems. The researchers were able to demonstrate that the model could achieve state-of-the-art few-shot performance across most tasks, in some cases, by a significant margin.
Apart from the various interesting features of this model, one feature that catches the attention is its decoder-only architecture. In fact, not just PaLM, some of the most popular and widely used language models are decoder-only.
Sign up for your weekly dose of what's up in emerging technology.
In the last few years, large neural networks have achieved impressive results across a wide range of tasks. Models like BERT and T5 are trained with an encoder only or encoder-decoder architectures. These models have demonstrated near-universal state of the art performance across thousands of natural language tasks. That said, the downside of such models is that they require a significant number of task-specific training to finetune the model and require at least a portion of the model parameters to be updated to fit the task, which adds complexity to the model finetuning and deployment.
GPT-3 has demonstrated that large autoregressive language models can be used for few-shot predictions, and this class of models is generally trained with decoder-only architecture and a standard left-to-right language modelling objective on a large text corpus where the objective is to predict the next token, taking into account the previous tokens. Not just GPT-3, the previous versions, GPT and GPT-2, too, utilised a decoder only architecture.
The original Transformer model is made of both encoder and decoder, where each forms a separate stack. This architecture fits well with its primary application – machine translation. The authors of the 2017 paper showed that most competitive neural sequence transduction models have an encoder-decoder structure, where the encoder maps the input sequence of symbol representations to a sequence of continuous representations, after which the decoder stack generates an output sequence of symbols one element at a time. The model is autoregressive at each step, meaning it consumes the previously generated symbols as additional input when generating the next. Transformers, while following this overall architecture, use stacked self-attention and fully connected, point-wise layers for encoder and decoder.
The newly attention mechanism introduced in Transformer meant that a user no longer needs to encode the full source sentence into a fixed-length vector. Instead, the decoder ‘attends’ to different parts of the source sentence at each step of output generation. The model learns what to attend to based on the input sentence and what it has output so far.
Subsequently, researchers from Google Brain published a paper titled – Generating Wikipedia by Summarising Long Sequences. The study proposed another arrangement of transformer blocks for language modelling. This model omitted the encoder block. For this research, the team introduced a new decoder-only sequence transduction model for the abstractive stage. They demonstrated that the model is capable of handling very long input-output examples. This model outperformed traditional encoder-decoder architectures on long sequences, allowing the team to condition on many reference documents and generate coherent and informative Wikipedia articles.
What performs better
As discussed above, large pertained transformer language models exhibit zero-shot generalisation. However, the architectures and the pretraining objectives used across the different state of the art models differ significantly. In a paper titled ‘What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?’, the authors presented a large-scale evaluation of modelling choices and their impact on zero-shot generalisation. They focused on text-to-text models and experimented with three model architectures – causal decoder only, non-causal decoder only, and encoder-decoder model. These models were trained with two different pretraining objectives – autoregressive and masked language modelling and evaluated with and without multitasking prompted finetuning.
Their experiment showed that causal decoder-only models that were trained on an autoregressive language modelling objective demonstrate the best zero-shot generalisation for purely unsupervised pretraining. They found that using autoregressive language modelling as a downstream task, a pretrained non-causal decoder model can be adapted to a performant generative causal decoder model. They also showed that the pretrained causal decoder model could be adapted to the non-causal decoder model.