Listen to this story
|
Recently, NVIDIA announced the newest CUDA Toolkit software release, 11.8, which focused on enhancing the programming model and CUDA application speedup through new hardware capabilities. CUDA is the proprietary tool NVIDIA uses on their graphic cards for performing highly concurrent math calculations.
The parallel computing war has always been intense. In 2021, AMD released its open-source GPUFORT as a competitor to NVIDIA’s CUDA, but CUDA still has a firm grip in the industry. Moreover, with NVIDIA’s products getting more powerful and cheaper every year, there are several other reasons for AMD’s failure to leave its mark in the market.
Despite each brand having its own set of strengths and weaknesses, AMD is outperformed by NVIDIA.
No functional alternative(s)
Most of the progress in AI in the past decade has been made using CUDA libraries, majorly because AMD didn’t have a functional alternative. The closest alternative to it is OpenCL (Open Computing Language). But in comparison, CUDA is more stable and modern and has better compatibility. Despite the claims that it has an API that can match CUDA, it isn’t easy to use. A study comparing CUDA programmes with OpenCL on NVIDIA GPUs showed that CUDA was 30% faster than OpenCL.
Furthermore, NVIDIA cards now have tensor cores that can run faster for training and inference on AI models. AMD’s FidelityFX Super Resolution offers similar features and works on almost any GPU but has no solid answer to tensor cores.
Not self-sustaining
Another reason AMD is so far behind is its lack of support for its own platforms. Users can write and run CUDA code if they buy an Nvidia GPU. One can also distribute it to other users. On the other hand, ROCm (Radeon Open Compute) doesn’t work on Radeon cards (RDNA) or Windows. The ROCm open software platform is a compute stack for system deployments. GUI-based software applications are currently not supported.
Late in the game
AMD has evidently been behind in this race for almost a decade—thereby, leading to a much wider adoption of the CUDA ecosystem. So now, not only does AMD need to work on R&D to build better (or at least on par) products, but it also needs to drive the adoption of its ecosystem. Furthermore, since Switching costs for researchers and developers are not insignificant, that is an additional barrier they need to break through.
So much work is being put into handcrafting optimizations for large-scale ML deployments that are hardware specific at the moment. However, the real battleground is on compilers and Nvidia devoted significant attention to them from the very beginning.
The bottom line
Considering the above discussion, the conclusion isn’t all too surprising. NVIDIA clearly takes a notable lead in the current AI landscape as it primarily focuses on GPGPU Programming, whereas AMD focuses on gaming. Therefore, most GPU programming is done on CUDA. AMD now has RoCm (Radeon Open Compute Platform) support with PyTorch, so we might be able to see more tools around AMD backends. Coupled with their new accelerators for data centres, this could all change in the near future.