A Deep Dive into Switch Transformer Architecture

Switch Transformer models were pretrained utilising 32 TPUs on the Colossal Clean Crawled Corpus, a 750 GB dataset composed of text snippets from Wikipedia, Reddit and others
A Deep Dive into Google's Switch Transformer Architecture

Earlier this year, researchers from Google Brain unveiled Switch Transformers, a natural-language processing (NLP) model with 1.6 trillion parameters, and improved training time up to 7x compared to the T5 NLP model, with comparable accuracy. The source code for Switch Transformer is available on GitHub.  

In a paper titled ‘Switch Transformer: scaling to trillion parameter models with simple and efficient sparsity,’ the researchers said their model used a mixture-of-experts (MoE) routing algorithm and design-intuitive improved models with reduced communication and computational costs. Their proposed training techniques mitigated the instabilities and showed that large sparse models could be trained with lower precision (bfloat16) formats. “Our large sparse models can be distilled into small dense versions while preserving 30 per cent of the sparse model quality gain.” according to Google researchers. 

In 1991, MoE was first introduced by a research group that included deep-learning and Switch Transformer creator Geoff Hinton. In 2017, the Google Brain team and Hinton used MoE to create an NLP model based on recurrent neural networks (RNN) of 137 billion parameters, where it achieved state-of-the-art (SOTA) results on language modelling and machine translation benchmarks. 

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Some of the key highlights of Switch Transformer include: 

  • The Switch Transformer is based on T5-Base and T5-Large models. Introduced by Google in 2019, T-5 is a transformer-based architecture that uses a text-to-text approach. 
  • Besides T5 models, Switch Transformer uses hardware initially designed for dense matrix multiplication and used in language models like TPUs and GPUs. 
  • Switch Transformer models were pretrained utilising 32 TPUs on the Colossal Clean Crawled Corpus, a 750 GB dataset composed of text snippets from Wikipedia, Reddit and others. The models were used to predict missing words in passages where 15 per cent of the words were masked. 


The researchers established a distributed training setup for the experiments, and the models split unique weights into different devices. Thus, as the weights increase in proportion to the number of devices, each device’s memory and computational methods remain manageable. 

In a Switch Transformer feed-forward neural network (FFN) layer, each token passes through a router function which sends it to a single FNN, known as an ‘expert.’ While each token passes through a single FFN, the computation does not increase with the number of experts.

“We replace the dense feed-forward network (FFN) layer present in the Transformer with a sparse Switch FFN layer. The layer operates independently on the tokens in the sequence. We diagram two tokens (x1 = “More” and x2 = ‘Parameters’ below) being routed across four FFN experts, where the router independently routes each token. The switch FFN layer returns the output of the selected FFN multiplied by the router gate value,” wrote Google researchers. The image below illustrates the Switch Transformer encoder block. 

Switch Transformer encoder block (Source: arXiv

Switch Transformer vs Others 

The transformer architecture has become the preferred deep-learning model for NLP research. Many efforts have been towards increasing the size of these models, primarily measured in the number of parameters. BAAI’s 1.75 trillion parameters Wu Dao 2.0 and OpenAI’s GPT-3 175 billion parameters, alongside HuggingFace DistilBERT and Google GShard, are other popular language models. 

Compared to Google’s T5 NLP model, the baseline version of the Switch Transformer achieved a target pre-training perplexity metrics in 1/7 the training time. It also outperformed a T5-XXL on the perplexity metric, with comparable or better performance on downstream NLP tasks, despite training on half of the data.  

For developing Switch Transformer, the Google team decided to maximise parameter count while keeping constant the number of FLOPs per training example and training on a ‘relatively small amount of data.’ 

Wrapping up 

Google believes that Switch Transformers are scalable and effective natural language learners. The team said that they simplified MoE to produce an architecture that is easy to understand, stable to train and more sample efficient than ‘equivalently-sized dense models.’ 

“We find that these models excel across a diverse set of ‘natural language tasks’ and in different training regimes, including ‘pre-training,’ ‘fine-tuning’ and multi-task training,” as per the Google Brain team. 

Amit Raja Naik
Amit Raja Naik is a seasoned technology journalist who covers everything from data science to machine learning and artificial intelligence for Analytics India Magazine, where he examines the trends, challenges, ideas, and transformations across the industry.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox