A Deep Dive into Switch Transformer Architecture

Switch Transformer models were pretrained utilising 32 TPUs on the Colossal Clean Crawled Corpus, a 750 GB dataset composed of text snippets from Wikipedia, Reddit and others
A Deep Dive into Google's Switch Transformer Architecture

Earlier this year, researchers from Google Brain unveiled Switch Transformers, a natural-language processing (NLP) model with 1.6 trillion parameters, and improved training time up to 7x compared to the T5 NLP model, with comparable accuracy. The source code for Switch Transformer is available on GitHub.  

In a paper titled ‘Switch Transformer: scaling to trillion parameter models with simple and efficient sparsity,’ the researchers said their model used a mixture-of-experts (MoE) routing algorithm and design-intuitive improved models with reduced communication and computational costs. Their proposed training techniques mitigated the instabilities and showed that large sparse models could be trained with lower precision (bfloat16) formats. “Our large sparse models can be distilled into small dense versions while preserving 30 per cent of the sparse model quality gain.” according to Google researchers. 

In 1991, MoE was first introduced by a research group that included deep-learning and Switch Transformer creator Geoff Hinton. In 2017, the Google Brain team and Hinton used MoE to create an NLP model based on recurrent neural networks (RNN) of 137 billion parameters, where it achieved state-of-the-art (SOTA) results on language modelling and machine translation benchmarks. 


Sign up for your weekly dose of what's up in emerging technology.

Some of the key highlights of Switch Transformer include: 

  • The Switch Transformer is based on T5-Base and T5-Large models. Introduced by Google in 2019, T-5 is a transformer-based architecture that uses a text-to-text approach. 
  • Besides T5 models, Switch Transformer uses hardware initially designed for dense matrix multiplication and used in language models like TPUs and GPUs. 
  • Switch Transformer models were pretrained utilising 32 TPUs on the Colossal Clean Crawled Corpus, a 750 GB dataset composed of text snippets from Wikipedia, Reddit and others. The models were used to predict missing words in passages where 15 per cent of the words were masked. 


The researchers established a distributed training setup for the experiments, and the models split unique weights into different devices. Thus, as the weights increase in proportion to the number of devices, each device’s memory and computational methods remain manageable. 

Download our Mobile App

In a Switch Transformer feed-forward neural network (FFN) layer, each token passes through a router function which sends it to a single FNN, known as an ‘expert.’ While each token passes through a single FFN, the computation does not increase with the number of experts.

“We replace the dense feed-forward network (FFN) layer present in the Transformer with a sparse Switch FFN layer. The layer operates independently on the tokens in the sequence. We diagram two tokens (x1 = “More” and x2 = ‘Parameters’ below) being routed across four FFN experts, where the router independently routes each token. The switch FFN layer returns the output of the selected FFN multiplied by the router gate value,” wrote Google researchers. The image below illustrates the Switch Transformer encoder block. 

Switch Transformer encoder block (Source: arXiv

Switch Transformer vs Others 

The transformer architecture has become the preferred deep-learning model for NLP research. Many efforts have been towards increasing the size of these models, primarily measured in the number of parameters. BAAI’s 1.75 trillion parameters Wu Dao 2.0 and OpenAI’s GPT-3 175 billion parameters, alongside HuggingFace DistilBERT and Google GShard, are other popular language models. 

Compared to Google’s T5 NLP model, the baseline version of the Switch Transformer achieved a target pre-training perplexity metrics in 1/7 the training time. It also outperformed a T5-XXL on the perplexity metric, with comparable or better performance on downstream NLP tasks, despite training on half of the data.  

For developing Switch Transformer, the Google team decided to maximise parameter count while keeping constant the number of FLOPs per training example and training on a ‘relatively small amount of data.’ 

Wrapping up 

Google believes that Switch Transformers are scalable and effective natural language learners. The team said that they simplified MoE to produce an architecture that is easy to understand, stable to train and more sample efficient than ‘equivalently-sized dense models.’ 

“We find that these models excel across a diverse set of ‘natural language tasks’ and in different training regimes, including ‘pre-training,’ ‘fine-tuning’ and multi-task training,” as per the Google Brain team. 

More Great AIM Stories

Amit Raja Naik
Amit Raja Naik is a seasoned technology journalist who covers everything from data science to machine learning and artificial intelligence for Analytics India Magazine, where he examines the trends, challenges, ideas, and transformations across the industry.

AIM Upcoming Events

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Early Bird Passes expire on 10th Feb

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox