Large language transformer models are able to constantly benefit from bigger architectures and increasing amounts of data. Since 2018, larger language models like BERT and its variants GPT-2 and GPT-3 have shown that a wide array of tasks can be performed using few-shot learning. Models like Microsoft and NVIDIA’s Megatron-Turing Natural Language Generation, which had 530 billion parameters, Generalist Language Model’s (GLaM) full version, which contained 1.2 trillion parameters, LaMDA or Language Models for Dialogue Applications which had 137 billion parameters; and Gopher which had 280 billion parameters, have marked the past few years just because of their sheer size. Has the desire to build bigger and bigger models become a mindless race?
A new paper released by Google AI disagrees with this assumption. The study’s results reiterate that larger models have more efficient sampling than smaller models because they apply transfer learning better. And with this, the team announced PaLM or Pathways Language Model, a 540 billion parameter, decoder-only Transformer model.
Last year in October, the Google Research team introduced a new AI architecture that could function like a human brain. Traditionally, an AI model can only be trained to specialise in a single task. Through Pathways, a single AI model can be generalised across a million different tasks. Pathways also enable the model to learn new tasks faster. Most models can perform just one modality: they can process either images, text or speech. Pathways would work in a way that one AI model can perform tasks across all modalities.
Instead of “dense” models that normally employ their entire neural network to complete a task, Pathways architecture has learned how to route its tasks only across the portion of the network that is relevant to the task. This makes the model more energy efficient and gives it more bandwidth to learn new tasks.
PaLM has been trained on hundreds of tasks involving language understanding and generation using the Pathways system. This is also the first time the Pathways system has been used to train a large-scale model that could scale training to 6144 chips. This is the biggest TPU-based configuration that has been used in training. As compared to previous large language models like GLaM and LaMDA that were trained on a single TPU v3 Pod, PaLM used data parallelism to train itself across two Cloud TPU v4 Pods.
The model was trained on the English language and multiple language datasets that included web documents, books, Wikipedia, GitHub code and conversations. Besides this, the team also maintained a “lossless” vocabulary that stored all whitespace documents with regards to coding and split Unicode not-in-vocabulary characters into bytes and numbers into digits.
Language understanding and generation: PaLM was tested on 29 of the most commonly used standard NLP tasks in English and outperformed its predecessors in 28 of these tasks. These tasks included sentence completion, Winograd-style tasks that involve reasoning, reading, comprehension and natural language inference tasks. PaLM also performed well in the multilingual NLP testing despite having been trained only on 22% of the non-English text.
The study discovered that the model’s performance as a function of scale follows a log-linear behaviour like the previous models, which suggests that performance improvements have not stabilised yet. The model was put up against Gopher and Chinchilla. PaLM demonstrated impressive contextual understanding to the extent that it was even able to guess the name of a film through emojis.
Reasoning: The model used chain-of-thought prompting to solve reasoning problems involving common sense and multi-step arithmetic. PaLM worked on three arithmetic and two commonsense reasoning datasets. In arithmetic, it was able to solve 58% of the problems using 8-shot prompting in GSM8K, a dataset of difficult grade school level maths, improving upon GPT-3’s 55%.
PaLM could also explain an entirely original joke that required complex multi-step logical inference and an understanding of deep language.
Code generation: PaLM, which was trained using only 5% code in pre-training, was more than able to generalise to writing code using few-shot learning. Its performance was on par with OpenAI’s Codex even though it used 50 times less Python code in the training dataset.
PaLM was fine-tuned on a Python-only dataset which is known as the PaLM-Coder. At a code repair task called DeepFix, PaLM-Coder was able to modify C programs that were initially broken at a success rate of 82.1%, outdoing the previous benchmark of 71.7%. This indicates a possibility that the PaLM-Coder could solve more complex coding problems eventually.
PaLM used its data parallelism strategy and reworked the transformer, allowing the attention and feedforward layers to be computed parallelly. This led to speedups from the TPU compiler optimisations, due to which PaLM showed a training efficiency of 57.8% hardware FLOPs utilisation – the highest that a large language model at this scale has reached.
PaLM’s breakthrough performance proves that after keeping ethical considerations in mind, this could be the first step towards building more capable models with greater scaling capabilities using the Pathways system.