Google introduced Tensor Processing Units or TPUs in the year 2016. TPUs, unlike GPUs, was custom-designed to deal with operations such as matrix multiplications in neural network training. Google TPUs can be accessed in two forms — cloud TPU and edge TPU. Cloud TPUs can be accessed from Google Colab notebook, which provides users with TPU pods that sit on Google’s data centres. Whereas, edge TPU is a custom-built development kit that can be used to build specific applications. In the next section, we will see the working of TPUs and its key components.
Key Components of TPUs
Before going into the working TPUs, here is some vocabulary related to it:
Tensors are multi-dimensional arrays or matrices. Tensors are fundamental units that can hold data points such as weights of a node in a neural network in a row and column format. Basic math operations are performed on tensors, including addition, element-wise multiplication, and matrix multiplication.
FLOPs (Floating point operations per second) are units of measure of performance of a computational operation. The custom floating-point format, in the case of Google TPUs, is called “Brain Floating Point Format,” or “bfloat16” for short. bfloat16 is carefully are placed within systolic arrays to accelerate neural network training. Higher the range of FLOPs, higher is the processing power.
A systolic array is a network of processors that are responsible for performing computations and passing the results across the system. It consists of a large number of processing elements (PEs) that are arranged in arrays, as illustrated above. These arrays have a high degree of parallelism and are favourable for parallel computing.
How TPUs Work
Tensor Processing Unit (TPU), a custom ASIC, built specifically for machine learning — and tailored for TensorFlow, can handle massive multiplications and additions for neural networks, at great speeds while reducing the use of too much power and floor space.
TPUs execute 3 main steps:
- First, the parameters are loaded from memory into the matrix of multipliers and adders.
- Then, data is loaded from memory.
- After every multiplication operation, the results are passed on to next multipliers while taking summation (dot product) at the same time. It can be seen in the above animation. The output is then given as the summation of all multiplication results between data and parameters.
A typical cloud TPU has two systolic arrays of size 128 x 128, aggregating 32,768 ALUs (Arithmetic Logic Units) for 16-bit floating-point values in a single processor. Thousands of multipliers and adders are connected to each other directly to form a large physical matrix of operators, which forms a systolic array architecture as discussed above.
TPU allows the chip to be more tolerant to reduced computational precision, which means it requires fewer transistors per operation. Because of this feature, a single chip can handle relatively more operations per second.
Since the TPUs are custom built for handling operations such as matrix multiplications and accelerating the training, TPUs might not be suitable for handling other kinds of workloads.
Limitations of Cloud TPUs:
- Non-matrix multiplication based workloads are unlikely to perform well on TPUs
- If a workload requires high-precision arithmetic, then TPUs are not the best choice
- Neural network workloads that contain custom TensorFlow operations written in C++ are not suitable
Where Are They Used
TPUs were used in the famous DeepMind’s AlphaGo, where the algorithms were used to beat the world’s best Go player Lee Sedol. It was also used in the AlphaZero system, which produced Chess, Shogi and Go playing programs. Google has also used TPUs for its Street View text processing services and was able to find all the text in the Street View database in less than five days. In the case of Google Photos, TPUs now enable the power to process over 100 million photos a day. Most importantly, TPUs were also used for the brains behind Google’s search results — RankBrain.
To know what makes TPUs successful, read this.