CUDA, NVIDIA’s parallel computing platform for general computing, is the leading proprietary framework for GPGPU. Acronym for Compute Unified Device Architecture, CUDA is a software layer giving direct access to the GPU’s virtual instruction set and parallel computational elements to execute compute kernels. The platform has been widely used in bioinformatics, life sciences, computer vision, electrodynamics, computational chemistry, finance, medical imaging etc. At NVIDIA, Stephen Jones, Architect, CUDA, spoke about CUDA at the GTC Conference, 2022. Analytics India Magazine has highlighted the key points from the talk covering the language and toolkit additions, latest developments and future of CUDA.
The Three Eras of Computing:
First era: Single core programming
During the first era, programming took a problem, breaking it down into a series of logical steps and writing a straight-line code that would run on a processor as fast as possible. It was highlighted by Dennard Scaling, who figured out that as you make transistors smaller, you can run them faster without increasing the power. But quantum mechanics defied this, and designers had to switch to Moore’s law. As a result, developers kept adding more transistors but couldn’t make it faster.
Second era: Parallel programming
The second era of programming realised all programs have to target multiple threads that broke down data into separate elements so they can be processed independently and at the same time, as data parallels. So the solution moved from a straight line code to task parallelism based on asynchronous execution. Eventually, this led to the introduction of the data centre for scale computing.
Sign up for your weekly dose of what's up in emerging technology.
Third era: Locality
The third era, which is present in CUDA, focuses on localisation. The era of locality aware computing focuses on where subjects are placed. Parallel computing is about a hierarchy of frameworks, libraries and runtimes where each hierarchy level zooms into the system. This allows programmers to choose their level of scaling. Scaling is an addition of data parallelism and locality of data.
At the backend: Hopper architecture
The NVIDIA Hopper architecture is at the backend of the CUDA platform. It extends MIG capabilities by up to 7x over the previous generation by offering secure multitenant configurations in cloud environments across each GPU instance. In addition, it introduces features to improve asynchronous execution and allow an overlap of memory copies with computation while minimising synchronisation points. Hopper also mitigates the issues of long training periods for giant models while maintaining the performance of GPUs.
CUDA works in a grid system for image processing. As Stephen demonstrated, an image is fragmented into a grid of blocks, where each block is run on the GPU as a completely separate program. GPUs can run up to thousands of such blocks based on parallelism in the GPU. These blocks work independently on their fragment of the problem and consist of threads.
Thread block cluster
Building onto this architecture, the team has leveraged CUDA’s massive scale and added a new tier of hierarchy called the thread block cluster. Described as Block.2, it is essentially a block of blocks. Stephen called it a new way of thinking about blocks, where in addition to the individual blocks, we can also target localised subjects for enhanced performance.
Here, the blocks within a thread block cluster live in a GPC processing cluster, with the cluster representing a capital of the parallels with more performance. Adding a cluster to the execution hierarchy allows an application to take advantage of faster local synchronisation and faster memory sharing. Various concurrent threads will be working together in a cluster yet running concurrently. This also allows researchers to move in the direction of annotating kernels while targeting the size that makes sense for their application.
Distributed shared memory
The cluster has all the properties of a CUDA thread block but is bigger. Additionally, every block in a cluster can read and write the shared memory of every other block.
Asynchronous one-sided memory copy
The NVIDIA team has heavily invested in asynchronous data that allows programmers to pause a copy and later pick up the result – a huge step towards keeping all dependencies local. It also frees up the threads to do other work all day long. This is called the Split barrier that allows programmers to keep the threads waiting to finish and wait for data to arrive. Threads can now sleep waiting purely for data. These transactions, called ‘self-synchronising rights’, allow the sender and recipient to know that the data was ready without needing some handshake. This one-sided memory copy is seven times faster than normal communication since it is just a single write operation. It can also be used in local memory for the data to get faster and easier.
TMA Unit in Hopper
Stephen explained that all of this is made possible by the tensor memory accelerator unit inside the Hopper. It takes the original copy of async and makes it bi-directional, and work between clusters. It enables one-sided data transfers as well. For instance, in neighbour-affecting algorithms, the problem is non-local. Self-synchronisation transactions enable 7x faster halo exchange between blocks in the same cluster. The TMA is a self-contained data movement engine, a separate hardware unit inside the SM that runs independently of the threads. It can take over and handle all calculations, allowing even a single thread to initiate a copy of the entire shared memory. The transaction barrier can wait for the data to arrive without syncing with each other. The accelerator can work on five dimension data.