A Beginner’s Guide To TPUs

Google introduced Tensor Processing Units or TPUs in the year 2016. TPUs, unlike GPUs, was custom-designed to deal with operations such as matrix multiplications in neural network training. Google TPUs can be accessed in two forms — cloud TPU and edge TPU.  Cloud TPUs can be accessed from Google Colab notebook, which provides users with TPU pods that sit on Google’s data centres. Whereas, edge TPU is a custom-built development kit that can be used to build specific applications. In the next section, we will see the working of TPUs and its key components.

Key Components of TPUs

Before going into the working TPUs, here is some vocabulary related to it:


Tensors are multi-dimensional arrays or matrices. Tensors are fundamental units that can hold data points such as weights of a node in a neural network in a row and column format. Basic math operations are performed on tensors, including addition, element-wise multiplication, and matrix multiplication.


FLOPs (Floating point operations per second) are units of measure of performance of a computational operation. The custom floating-point format, in the case of Google TPUs, is called “Brain Floating Point Format,” or “bfloat16” for short. bfloat16 is carefully are placed within systolic arrays to accelerate neural network training. Higher the range of FLOPs, higher is the processing power.

Systolic array

via H T Kung, CMU

A systolic array is a network of processors that are responsible for performing computations and passing the results across the system. It consists of a large number of processing elements (PEs) that are arranged in arrays, as illustrated above. These arrays have a high degree of parallelism and are favourable for parallel computing.

How TPUs Work

Math operation via Google Cloud docs

Tensor Processing Unit (TPU), a custom ASIC, built specifically for machine learning — and tailored for TensorFlow, can handle massive multiplications and additions for neural networks, at great speeds while reducing the use of too much power and floor space.

TPUs execute 3 main steps:

  1. First, the parameters are loaded from memory into the matrix of multipliers and adders.
  2. Then, data is loaded from memory. 
  3. After every multiplication operation, the results are passed on to next multipliers while taking summation (dot product) at the same time. It can be seen in the above animation. The output is then given as the summation of all multiplication results between data and parameters.

A typical cloud TPU has two systolic arrays of size 128 x 128, aggregating 32,768 ALUs (Arithmetic Logic Units) for 16-bit floating-point values in a single processor. Thousands of multipliers and adders are connected to each other directly to form a large physical matrix of operators, which forms a systolic array architecture as discussed above. 

TPU allows the chip to be more tolerant to reduced computational precision, which means it requires fewer transistors per operation. Because of this feature, a single chip can handle relatively more operations per second.

Since the TPUs are custom built for handling operations such as matrix multiplications and accelerating the training, TPUs might not be suitable for handling other kinds of workloads.

Limitations of Cloud TPUs:

  • Non-matrix multiplication based workloads are unlikely to perform well on TPUs
  • If a workload requires high-precision arithmetic, then TPUs are not the best choice
  • Neural network workloads that contain custom TensorFlow operations written in C++ are not suitable

Where Are They Used

TPUs were used in the famous DeepMind’s AlphaGo, where the algorithms were used to beat the world’s best Go player Lee Sedol. It was also used in the AlphaZero system, which produced Chess, Shogi and Go playing programs. Google has also used TPUs for its Street View text processing services and was able to find all the text in the Street View database in less than five days. In the case of Google Photos, TPUs now enable the power to process over 100 million photos a day. Most importantly, TPUs were also used for the brains behind Google’s search results — RankBrain.

To know what makes TPUs successful, read this.

Download our Mobile App

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

6 IDEs Built for Rust

Rust IDEs aid efficient code development by offering features like code completion, syntax highlighting, linting, debugging tools, and code refactoring

Can OpenAI Save SoftBank? 

After a tumultuous investment spree with significant losses, will SoftBank’s plans to invest in OpenAI and other AI companies provide the boost it needs?

Oracle’s Grand Multicloud Gamble

“Cloud Should be Open,” says Larry at Oracle CloudWorld 2023, Las Vegas, recollecting his discussions with Microsoft chief Satya Nadella last week.