MITB Banner

A Beginner’s Guide To TPUs

Share

Google introduced Tensor Processing Units or TPUs in the year 2016. TPUs, unlike GPUs, was custom-designed to deal with operations such as matrix multiplications in neural network training. Google TPUs can be accessed in two forms — cloud TPU and edge TPU.  Cloud TPUs can be accessed from Google Colab notebook, which provides users with TPU pods that sit on Google’s data centres. Whereas, edge TPU is a custom-built development kit that can be used to build specific applications. In the next section, we will see the working of TPUs and its key components.

Key Components of TPUs

Before going into the working TPUs, here is some vocabulary related to it:

Tensor

Tensors are multi-dimensional arrays or matrices. Tensors are fundamental units that can hold data points such as weights of a node in a neural network in a row and column format. Basic math operations are performed on tensors, including addition, element-wise multiplication, and matrix multiplication.

bfloat

FLOPs (Floating point operations per second) are units of measure of performance of a computational operation. The custom floating-point format, in the case of Google TPUs, is called “Brain Floating Point Format,” or “bfloat16” for short. bfloat16 is carefully are placed within systolic arrays to accelerate neural network training. Higher the range of FLOPs, higher is the processing power.

Systolic array

via H T Kung, CMU

A systolic array is a network of processors that are responsible for performing computations and passing the results across the system. It consists of a large number of processing elements (PEs) that are arranged in arrays, as illustrated above. These arrays have a high degree of parallelism and are favourable for parallel computing.

How TPUs Work

Math operation via Google Cloud docs

Tensor Processing Unit (TPU), a custom ASIC, built specifically for machine learning — and tailored for TensorFlow, can handle massive multiplications and additions for neural networks, at great speeds while reducing the use of too much power and floor space.

TPUs execute 3 main steps:

  1. First, the parameters are loaded from memory into the matrix of multipliers and adders.
  2. Then, data is loaded from memory. 
  3. After every multiplication operation, the results are passed on to next multipliers while taking summation (dot product) at the same time. It can be seen in the above animation. The output is then given as the summation of all multiplication results between data and parameters.

A typical cloud TPU has two systolic arrays of size 128 x 128, aggregating 32,768 ALUs (Arithmetic Logic Units) for 16-bit floating-point values in a single processor. Thousands of multipliers and adders are connected to each other directly to form a large physical matrix of operators, which forms a systolic array architecture as discussed above. 

TPU allows the chip to be more tolerant to reduced computational precision, which means it requires fewer transistors per operation. Because of this feature, a single chip can handle relatively more operations per second.

Since the TPUs are custom built for handling operations such as matrix multiplications and accelerating the training, TPUs might not be suitable for handling other kinds of workloads.

Limitations of Cloud TPUs:

  • Non-matrix multiplication based workloads are unlikely to perform well on TPUs
  • If a workload requires high-precision arithmetic, then TPUs are not the best choice
  • Neural network workloads that contain custom TensorFlow operations written in C++ are not suitable

Where Are They Used

TPUs were used in the famous DeepMind’s AlphaGo, where the algorithms were used to beat the world’s best Go player Lee Sedol. It was also used in the AlphaZero system, which produced Chess, Shogi and Go playing programs. Google has also used TPUs for its Street View text processing services and was able to find all the text in the Street View database in less than five days. In the case of Google Photos, TPUs now enable the power to process over 100 million photos a day. Most importantly, TPUs were also used for the brains behind Google’s search results — RankBrain.

To know what makes TPUs successful, read this.

PS: The story was written using a keyboard.
Share
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India