# A Beginner’s Guide To TPUs

Google introduced Tensor Processing Units or TPUs in the year 2016. TPUs, unlike GPUs, was custom-designed to deal with operations such as matrix multiplications in neural network training. Google TPUs can be accessed in two forms — cloud TPU and edge TPU.  Cloud TPUs can be accessed from Google Colab notebook, which provides users with TPU pods that sit on Google’s data centres. Whereas, edge TPU is a custom-built development kit that can be used to build specific applications. In the next section, we will see the working of TPUs and its key components.

### Key Components of TPUs

Before going into the working TPUs, here is some vocabulary related to it:

Tensor

Tensors are multi-dimensional arrays or matrices. Tensors are fundamental units that can hold data points such as weights of a node in a neural network in a row and column format. Basic math operations are performed on tensors, including addition, element-wise multiplication, and matrix multiplication.

bfloat

FLOPs (Floating point operations per second) are units of measure of performance of a computational operation. The custom floating-point format, in the case of Google TPUs, is called “Brain Floating Point Format,” or “bfloat16” for short. bfloat16 is carefully are placed within systolic arrays to accelerate neural network training. Higher the range of FLOPs, higher is the processing power.

Systolic array

A systolic array is a network of processors that are responsible for performing computations and passing the results across the system. It consists of a large number of processing elements (PEs) that are arranged in arrays, as illustrated above. These arrays have a high degree of parallelism and are favourable for parallel computing.

### How TPUs Work

Tensor Processing Unit (TPU), a custom ASIC, built specifically for machine learning — and tailored for TensorFlow, can handle massive multiplications and additions for neural networks, at great speeds while reducing the use of too much power and floor space.

TPUs execute 3 main steps:

1. First, the parameters are loaded from memory into the matrix of multipliers and adders.
2. Then, data is loaded from memory.
3. After every multiplication operation, the results are passed on to next multipliers while taking summation (dot product) at the same time. It can be seen in the above animation. The output is then given as the summation of all multiplication results between data and parameters.

A typical cloud TPU has two systolic arrays of size 128 x 128, aggregating 32,768 ALUs (Arithmetic Logic Units) for 16-bit floating-point values in a single processor. Thousands of multipliers and adders are connected to each other directly to form a large physical matrix of operators, which forms a systolic array architecture as discussed above.

TPU allows the chip to be more tolerant to reduced computational precision, which means it requires fewer transistors per operation. Because of this feature, a single chip can handle relatively more operations per second.

Since the TPUs are custom built for handling operations such as matrix multiplications and accelerating the training, TPUs might not be suitable for handling other kinds of workloads.

Limitations of Cloud TPUs:

• Non-matrix multiplication based workloads are unlikely to perform well on TPUs
• If a workload requires high-precision arithmetic, then TPUs are not the best choice
• Neural network workloads that contain custom TensorFlow operations written in C++ are not suitable

### Where Are They Used

TPUs were used in the famous DeepMind’s AlphaGo, where the algorithms were used to beat the world’s best Go player Lee Sedol. It was also used in the AlphaZero system, which produced Chess, Shogi and Go playing programs. Google has also used TPUs for its Street View text processing services and was able to find all the text in the Street View database in less than five days. In the case of Google Photos, TPUs now enable the power to process over 100 million photos a day. Most importantly, TPUs were also used for the brains behind Google’s search results — RankBrain.

To know what makes TPUs successful, read this.

I have a master's degree in Robotics and I write about machine learning advancements.

## Our Upcoming Events

### Telegram group

Discover special offers, top stories, upcoming events, and more.

### Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

### Why Atlassian Chose Not to Rush Through LLMs

Last week, Atlassian’s CTO Rajeev Rajan sat down with AIM to list down the company’s technological priorities

### 6 IDEs Built for Rust

Rust IDEs aid efficient code development by offering features like code completion, syntax highlighting, linting, debugging tools, and code refactoring

### The Rise and Fall of JS Frameworks

Node.js is not broken enough to be fixed

### ‘Upskilling of Engineering Talent Key to Staying Relevant in Global Markets’

The company remains dedicated to upskill its employees and help them navigate new technologies and roles.

### Why Time is Ripe for the ‘Real’ GPT-4

OpenAI ups the ante to challenge Gemini with GPT-Vision

### Is MongoDB Vector Search the Panacea for all LLM Problems?

By introducing proprietary data, developers can narrow down the pool of possible responses, significantly reducing the likelihood of hallucinations

### Why Intel Closing the Gap with NVIDIA is Good News

Gaudi2’s performance surpassed NVIDIA H100’s on a state-of-the-art vision language model on Hugging Face’s performance benchmarks

### Can OpenAI Save SoftBank?

After a tumultuous investment spree with significant losses, will SoftBank’s plans to invest in OpenAI and other AI companies provide the boost it needs?

### Oracle’s Grand Multicloud Gamble

“Cloud Should be Open,” says Larry at Oracle CloudWorld 2023, Las Vegas, recollecting his discussions with Microsoft chief Satya Nadella last week.

### NVIDIA Catches Up to AMD, Intel with MCM Design

GH100 was also expected to have an MCM design, but it came with a monolithic architecture again.