“GPUs can be incredibly challenging to optimise for locality and parallelism, especially for computations that cannot be efficiently implemented using a combination of pre-existing optimized primitives.”
OpenAI
At bare minimum, a deep neural network is a bunch of mathematical operations (like addition and multiplication) performed thousands of times every millisecond to find patterns in the input data. The power of deep neural networks (DNNs) come from their hierarchical structure and sequential nature of parametric (eg: convolutional) and non-parametric layers. The highly parallelisable nature of these models were exploited by graphic processing units(GPUs), which were originally designed for PC games to run realistic physics simulations(think: the movement of leaves in a wind).
NVIDIA’s first GPU was launched in 1999. Gradually, researchers started to realise the superior floating point performance of these GPUs for general purpose computing and started to apply it aggressively. In 2003, a team of researchers unveiled Brook, the first widely adopted programming model to extend C with data-parallel constructs. Later, NVIDIA launched CUDA in 2006, the world’s first solution for general-computing on GPUs.
GPUs quickly became popular with DNNs setting new benchmarks almost every year, especially post the 2012 ImageNet explosion. GPU owes its popularity to frameworks for General-Purpose GPU computing, such as CUDA and OpenCL. Such frameworks have made the development of high-performance programs easier. NVIDIA’s CUDA is a parallel computing platform and programming model for general computing on GPUs. With CUDA, developers were able to dramatically expedite computing applications.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
GPUs are well suited for DNA due to the distribution of workloads. While the sequential part of the workload runs on the CPU (which is optimised for single-threaded performance), the compute intensive portion of the application runs on thousands of GPU cores in parallel. CUDA developers code in popular languages such as C, C++, Fortran, Python and MATLAB and express parallelism through extensions.

Since its inception, the CUDA ecosystem has grown rapidly to include software development tools, services and partner-based solutions. However, writing GPU kernels is challenging. According to the team at OpenAI, it is difficult to optimise GPUs for locality and parallelism. GPU architectures are also rapidly evolving and specialising (eg: tensor cores in NVIDIA and AMD’s microarchitectures). To address the complexity in GPU programming, OpenAI has open sourced a Python-like programming language called Triton. Triton is a language and compiler for parallel programming. It provides a Python-based programming environment for productively writing custom DNN compute kernels capable of running at maximal throughput on modern GPU hardware.
Why Triton
CUDA implementations of this parallelization strategy can be challenging to write. Popular libraries such as cuBLAS and cuDNN only support a restricted set of tensor operations, leaving the implementation of novel primitives to experts. The practical difficulty of GPU programming and the rise in demand of DNN based applications geared the developers towards Domain-Specific Languages (DSLs) and compilers. However, DSLs–based on polyhedral machinery or scheduling languages,–remain less flexible, and according to OpenAI, are slower than the best handwritten compute kernels available in libraries like cuBLAS, cuDNN or TensorRT. According to the original authors of Triton, these systems generally perform well for certain classes of problems such as depthwise-separable convolutions; are often much slower than vendor libraries in practice; and lack the expressivity necessary to implement structured sparsity patterns required for linear speedup and efficient usage of GPUs.
The advantages of Triton come at the expense of increased programming efforts. According to the researchers, Triton relies on the addition of tile-level operations and optimisations into traditional compilation pipelines and provides more flexibility with automatic inference and other features. The purpose of Triton is to provide a stable frontend for DNN transcompilers, as well as programmers familiar with low-level GPU programming. It has CUDA-like syntax, Numpy-like semantics and functions on a “Single-Program, Multiple-Data ” (SPMD) programming model. The execution of CUDA code on GPUs is supported by an SPMD programming model, where each kernel is associated with an identifiable thread-block. The Triton programming model is similar, but each kernel is single-threaded, automatically parallelised and associated with a set of global ranges that varies from instance to instance. This approach leads to simpler kernels in which CUDA-like concurrency primitives are non-existent. It offers programmers more flexibility than current DSLs while allowing compilers to aggressively optimise programs for data locality and parallelism. For instance, when used for element wise operations in neural networks, Triton achieves peak performance with just ~25 lines of Python code compared to CUDA.
Try Triton here.
You can install the latest stable release of Triton from pip:
pip install triton
Key takeaways
- The superior performance of Triton comes from a modular system architecture.
- Triton makes non-trivial modifications of matrix multiplication kernels accessible to developers with minimum expertise.
- Triton simplifies the development of specialised kernels.
- Triton programs can be efficiently and automatically parallelised.