In modern data science projects where we are heading more towards neural networks, we know that the memory consumption of any computer system by the networks and layers is very high than the normal machine learning programs. We need to build our models to perform accurately and with higher speed without harming the computer system. Normally in a position where we are modeling neural networks, we require an enabled GPU in our system to speed up training the model and computer programs. To work efficiently with the GPUs, we also require deep knowledge of GPU programming. GPU programming or general-purpose computing on a GPU is the program to use the GPU and CPU to speed up the computation in applications.
Several GPU programming platforms are available, like Nvidia’s CUDA, OpenCL, OpenACC; these platforms allow users to focus on high-level computing concepts by ignoring the language barrier between the GPU and CPU. But in today’s nature of the development of models, one should know these technologies. These advanced technologies are becoming harder to learn. If we are not familiar with using this advanced computing and need to perform, it will take some time to learn and perform because they are deep in concepts and require rigorous programming. Here comes the triton to save our time and effort.
Triton is a python-like programming language that provides features like writing GPU codes without having so much experience. We can also write efficient programs of GPU programming in just a few lines of code. It takes a lot of effort to write a single program. These features also boost the kernels’ efficiency up to 2x than the torch implementations.
The modern architecture of GPUs can be divided into three parts.
- DRAM (Dynamic Random Access Memory) – DRAM is a type of random access memory, and semiconductor memory that stores each bit of data in a memory cell and uses it for program code needed by the computer processor and deletes it after the process is done.
- SRAM(Static Random Access Memory) – SRAM is a random access memory that holds data bits and programs into its memory cell until the computer’s power is being supplied.
- ALU (Arithmetic logic unit)- ALU is the part of a computer processor that carries out the operation (arithmetic or logical) on the operands according to the command from the computer.

Image source.
Here, Core(SM) is a streaming multiprocessor(SMs). Each SMs contains eight cores, and different processes get distributed between the cores. Any GPU programming consists of many challenges, and roughly we can divide them into three cases.
- Transfer of memory from DRAM needs to be gathered in large amounts to utilize the more width provided by the memory interface.
- Data must be stored in a way in SRAM to be utilized and managed repeatedly and does not become difficult to retrieve data or share data.
- All the computation must be partitioned and scheduled in such a way both across Sm and within SMs so that the ALU can promote thread-level parallelism.
All the challenges are the subject of research and even become difficult for experienced people. The main motive of Triton is to automate these optimizations so that data science practitioners can better focus on the performance of the model instead of focusing on solving the distribution and scheduling the GPU unit. Triton facilitates us with three types of automation.
- Memory coalescing
- Shared memory management
- Scheduling (within SMs)
Next, in this article, we will try to make a fused softmax operation using triton, torch native and torch jit and compare the performance against them. Before making the operation, let’s understand the Fused softmax kernel. Kernel functions (kernels) can implicitly map data points into high dimensional spaces and make it easier to learn complex decision boundaries.
The softmax function is a function that turns a vector of K real values into a vector of K real values that sum to 1. “Fusing” means commonalization of computation steps. It’s an implementation trick to run code more efficiently by combining similar operations in single hardware (GPU, CPU or TPU). Therefore, a “fusedLayer” is a layer where operations benefit from a “fused” implementation. The motive of the next code is to customize the GPU kernels for element-wise additions, where we will use some simple softmax operations.
Implementation in Triton
For GPU programming, there will be a possibility of not having a computer system with GPU or TPU, so we implement this code in Google colab. To enable GPU in Google Colab we can directly navigate to the Edit, then navigate to Notebook settings and then select the GPU from the drop-down menu from the hardware accelerator.
After enabling the GPU, we can start with the code.
Computing the kernel using the torch jit.
Input:
import torch
@torch.jit.script
def Pytorch_naive(x):
x_max = x.max(dim=1)[0]
z = x - x_max[:, None]
nume = torch.exp(x)
denomi = nume.sum(dim=1)
ret = nume / denomi[:, None]
return ret
Here, we are implementing the naively in PyTorch and computing the row-wise softmax of x, which requires reading of 7 MN elements from DRAM and writing back 7MN and 5MN and 2M elements.
Installing the triton in the notebook.
Input:
!pip install triton
Output:

Computing kernels using triton.
Input:
import triton
import triton.language as tl
@triton.jit
def Triton_softmax(Y, X, stride_xm, stride_ym, M, N, **meta):
m = tl.program_id(0)
n = tl.arange(0, meta['BLOCK'])
X = X + m * stride_xm + n
x = tl.load(X, mask=n < N, other=-float('inf'))
z = x - tl.max(x, axis=0)
numerator = tl.exp(z)
y = numerator / denominator
Y = Y + m * stride_ym + n
tl.store(Y, y, mask=n < N)
In the above codes, we are trying to make each program load the rows of the input matrix X, then normalizing and writing the results in Y. A limitation of triton is to provide each block with a power of two numbers of elements, so we need the padding of the row to guard the memory operation properly.
Input:
def next_power_of_2(n):
n |= n >> 1
n |= n >> 2
n |= n >> 4
n |= n >> 8
n |= n >> 16
n += 1
return n
Provided power of two to each element.
Input:
def Helper_softmax(x):
M, N = x.shape
BLOCK = next_power_of_2(N)
num_warps = 4
if BLOCK >= 2048: num_warps = 8
if BLOCK >= 4096: num_warps = 16
y = torch.empty_like(x)
Triton_softmax[(M, )](y, x, x.stride(0), y.stride(0), M, N, num_warps=num_warps, BLOCK=BLOCK)
We defined the helper function, which will enquire the kernel. The block size is defined so that the smallest power of two elements has a greater size than the number of columns in the X matrix.
Making assurance for the kernel while testing in the matrix with an irregular number of rows and columns can verify the padding we have provided to guard the memory operation properly.
Input:
torch.manual_seed(0)
x = torch.randn(1823, 781, device='cuda')
y_tri = softmax(x)
y_ref = torch.softmax(x, axis=1)
print(torch.allclose(y_tri, y_ref))
Output:

Next, we can compare the performance of the Triton with torch.softmax and naive_softmax. To perform this, we need to benchmark our operation.
Input:
@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=['N'],
x_vals=[128 * i for i in range(2, 100)],
line_arg='provider',
line_vals=['triton', 'torch-native', 'torch-jit'],
line_names=['Triton', 'Torch_native', 'Torch_jit'],
styles=[('blue', '-'), ('green', '-'), ('yellow', '-')],
ylabel="GB/s",
plot_name="softmax-performance",
args={'M': 4096}
)
)
def benchmark(M, N, provider):
x = torch.randn(M, N, device='cuda', dtype=torch.float32)
if provider == 'torch-native':
ms, min_ms, max_ms = triton.testing.do_bench(lambda: torch.softmax(x, axis=-1))
if provider == 'triton':
ms, min_ms, max_ms = triton.testing.do_bench(lambda: softmax(x))
if provider == 'torch-jit':
ms, min_ms, max_ms = triton.testing.do_bench(lambda: Pytorch_naive(x))
gbps = lambda ms: 2 * x.nelement() * x.element_size() * 1e-9 / (ms * 1e-3)
return gbps(ms), gbps(max_ms), gbps(min_ms)
benchmark.run(show_plots=True, print_data=True)
Output:

We can also compare the generated data.
Input:
benchmark.run(print_data = True)
Output:

We can see how faster the Triton compares to the torch jit and triton softmax in the above output. In triton, whenever the data in cache form is too large for the process, it transfers almost double the amount of necessary memory, making it faster than the others. And by the coding part, we can also understand that Triton is easy and more reliable than PyTorch’s CUDA kernel and easy to understand and maintain the process.
References:
- Introducing Triton
- Fused Softmax.
- Google Colab For codes