Earlier this year, researchers from Google Brain unveiled Switch Transformers, a natural-language processing (NLP) model with 1.6 trillion parameters, and improved training time up to 7x compared to the T5 NLP model, with comparable accuracy. The source code for Switch Transformer is available on GitHub.
In a paper titled ‘Switch Transformer: scaling to trillion parameter models with simple and efficient sparsity,’ the researchers said their model used a mixture-of-experts (MoE) routing algorithm and design-intuitive improved models with reduced communication and computational costs. Their proposed training techniques mitigated the instabilities and showed that large sparse models could be trained with lower precision (bfloat16) formats. “Our large sparse models can be distilled into small dense versions while preserving 30 per cent of the sparse model quality gain.” according to Google researchers.
In 1991, MoE was first introduced by a research group that included deep-learning and Switch Transformer creator Geoff Hinton. In 2017, the Google Brain team and Hinton used MoE to create an NLP model based on recurrent neural networks (RNN) of 137 billion parameters, where it achieved state-of-the-art (SOTA) results on language modelling and machine translation benchmarks.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
Some of the key highlights of Switch Transformer include:
- The Switch Transformer is based on T5-Base and T5-Large models. Introduced by Google in 2019, T-5 is a transformer-based architecture that uses a text-to-text approach.
- Besides T5 models, Switch Transformer uses hardware initially designed for dense matrix multiplication and used in language models like TPUs and GPUs.
- Switch Transformer models were pretrained utilising 32 TPUs on the Colossal Clean Crawled Corpus, a 750 GB dataset composed of text snippets from Wikipedia, Reddit and others. The models were used to predict missing words in passages where 15 per cent of the words were masked.
Experiments
The researchers established a distributed training setup for the experiments, and the models split unique weights into different devices. Thus, as the weights increase in proportion to the number of devices, each device’s memory and computational methods remain manageable.
In a Switch Transformer feed-forward neural network (FFN) layer, each token passes through a router function which sends it to a single FNN, known as an ‘expert.’ While each token passes through a single FFN, the computation does not increase with the number of experts.
“We replace the dense feed-forward network (FFN) layer present in the Transformer with a sparse Switch FFN layer. The layer operates independently on the tokens in the sequence. We diagram two tokens (x1 = “More” and x2 = ‘Parameters’ below) being routed across four FFN experts, where the router independently routes each token. The switch FFN layer returns the output of the selected FFN multiplied by the router gate value,” wrote Google researchers. The image below illustrates the Switch Transformer encoder block.
Switch Transformer encoder block (Source: arXiv)
Switch Transformer vs Others
The transformer architecture has become the preferred deep-learning model for NLP research. Many efforts have been towards increasing the size of these models, primarily measured in the number of parameters. BAAI’s 1.75 trillion parameters Wu Dao 2.0 and OpenAI’s GPT-3 175 billion parameters, alongside HuggingFace DistilBERT and Google GShard, are other popular language models.
Compared to Google’s T5 NLP model, the baseline version of the Switch Transformer achieved a target pre-training perplexity metrics in 1/7 the training time. It also outperformed a T5-XXL on the perplexity metric, with comparable or better performance on downstream NLP tasks, despite training on half of the data.
For developing Switch Transformer, the Google team decided to maximise parameter count while keeping constant the number of FLOPs per training example and training on a ‘relatively small amount of data.’
Wrapping up
Google believes that Switch Transformers are scalable and effective natural language learners. The team said that they simplified MoE to produce an architecture that is easy to understand, stable to train and more sample efficient than ‘equivalently-sized dense models.’
“We find that these models excel across a diverse set of ‘natural language tasks’ and in different training regimes, including ‘pre-training,’ ‘fine-tuning’ and multi-task training,” as per the Google Brain team.