Google has developed and benchmarked Switch Transformers, a technique to train language models, with over a trillion parameters. The research team said the 1.6 trillion parameter model is the largest of its kind and has better speeds than T5-XXL, the Google model that previously held the title.
Switch Transformer
According to the researchers, the Mixture of Experts (MoE) models, despite being more effective than other deep learning models, face issues due to their complexity, lack of accessibility, and computational costs. As opposed to the traditional models that use the same parameters for all the inputs, MoE selects different parameters for each input. While we may get a sparsely activated model with MoE, it leads to having a massive number of parameters, resulting in disadvantages discussed above.
Google researchers have developed Switch Transformer to create a system that would increase the parameter count while maintaining the floating-point operations (FLOPS) per input constant. It does that by using only a part of the model’s weight or parameters to input data within a model.
The Experiment
The Switch Transformer is based on T5-Base and T5-Large models. In the T5 model (introduced by Google in 2019), all the NLP tasks are unified into a text-to-text format where both the input and output are always text strings.
In addition to the T5 models, Switch Transformers use hardware initially designed for dense matrix multiplication, and also used in language models such as GPUs and TPUs.
The researchers established a distributed training setup for the experiment, and the models split unique weights into different devices. While the weights increase in proportion to the number of devices, the memory and the computational footprint of each device remains manageable.
Switch Transformer models, using 32 TPUs, were pretrained on the Colossal Clean Crawled Corpus — a 750 GB dataset composed of text snippets from Reddit, Wikipedia, among others. For the experiment, the Switch Transformer models were used to predict missing words in passages where 15% of the words were masked. Other challenges included language translation and answering a series of tough questions.
Performance of Switch Transformer Model
The researchers claimed that the model performed better than the smaller T5-XXL model with 400 billion parameters. Further, the new model didn’t manifest any training instability.
The Switch Transformer also showed marked improvement in delivering downstream tasks. The model maintained seven times higher pretraining speed while using the same amount of computational resources.
On the translation front, the Switch Transformer model, which was trained to translate between 100 languages, did so with four times the higher speed for at least 91% languages.
However, the model’s performance was unsatisfactory compared to the baseline model on the Sanford Question Answering Dataset (SQuAD). The researchers chalked it up to “a poorly understood dependence between fine-tuning quality, FLOPS per token and number of parameters.”
Looking Forward
The current approach falls within the adaptive computation algorithm that uses identical and homogeneous parameters. In the future, the researchers hope to support heterogeneous parameters, facilitated by a more flexible infrastructure.
Further, the Switch Transformer will be applied across different modalities as well as multi-modal networks. Like other large language models such as GPT-2, GPT-3, BERT, RoBERTa, Switch Transformer is also susceptible to biases. Such biases may result in the spread of misinformation, phishing, abuse of legal and governmental processes, and radical social engineering etc.