CUDA is a parallel computing architecture created by NVIDIA and is specifically designed to be used with NVIDIA GPUs. It is utilised in different sectors of science and research applications in medical imaging, financial modelling and energy exploration. With the increasing usage of GPUs, which have made computing more efficient, CUDA is growing rapidly and has caught the interest of numerous techies.
In this article, Analytics India Magazine compiles a list of the top open-source resources for NVIDIA’s parallel programming platform:
Libraries
- CUDPP
CUDPP is a low-level library which is recommended by many in order to reach the best possible performance for CUDA. This library provides 15 parallel primitives, and when compared to Thrust, this is more performance-oriented but compromises a little on the programmer productivity.
- Thrust
Thrust is a library for parallel algorithms which is similar to C++ Standard Template Library (STL). Thrust offers interoperability, high-level interface and high-performance.
- Hemi
Hemi simplifies writing CUDA C/C++ code. One can write parallel kernels similar to writing loops in line in CPU code. One can launch C++ Lambda functions as GPU kernels, and the details like thread block size and grid size are optimisation details rather than requirements.
- CUB
CUB stands for CUDA Unbound. CUB was extensively designed for CUDA applications. CUB is a performance-oriented library, and because it was explicitly made for CUDA, it is slightly more flexible than Thrust.
- Chag::pp- Parallel primitives library
The authors of chag::pp has the fastest implementation of Stream Compaction and Prefix Sum (demonstration). The library provides implementations of reduction, prefix operations (scan), radix sort, compaction.
Best papers on Cida
- Efficient Parallel Scan Algorithms for Many-core GPUs: The paper shows how scan and segmented scan algorithms can be implemented using divide and conquer approach. This divide and conquer approach builds all scan primitives on top of a set of primitive intra-warp scan routines.
- Multireduce and Multiscan on Modern GPUs: As the name suggests, the paper gives details about how one can implement Multireduce and Multiscan on the GPU. This is a master’s thesis by Marco Eilers.
- Modern GPU: Modern GPU describes all the algorithms and strategies for coding CUDA as fast as possible, it also consists of a library with all the explained concepts are implemented.
NVIDIA’s CUDA programming guide and best practices:
Online Courses for CUDA
- CUDA programming Masterclass – Udemy: This course contains details about parallel programming on GPUs from basic concepts to advanced algorithm implementations. This course includes organisations of threads like blockDim, blockldx and gridDim. Also, this course teaches about Unique index calculation for the 2D grid, sum array implementation, memory transfer between device and host and device properties.
- [Coursera] Heterogeneous Parallel Programming by Wen-mei W. Hwu (University of Illinois): This course from Wen-mei W. Hwu introduces concepts, languages, techniques and patterns for programming heterogeneous and massively parallel processors. This course covers heterogeneous computing architectures, data-parallel programming models, techniques fro memory bandwidth management and parallel algorithm patterns.
Best books:
- CUDA by example by Edward Kandrot, Jason Sanders
- Programming massively parallel processors: A hands-on approach by David Kirk and Wen-Mei Hwu.
- GPU Gems Edited by Wen-Mei Hwu
- CUDA application programming by Rob Farber
- Parallel Calculus in CUDA by Nicolas Hecquet
- CUDA Cookbook by Bharatkumar Sharma, Jack Han
Presentations:
- CUDA C/C++ BASICS: This presentation is from NVIDIA corporations. The presentation explains the concepts of CUDA kernels, threads, thread blocks, thread synchronisation, memory management, shared memory.
- Optimising Parallel Reduction in CUDA: This presentation shows how relatively simple and fast it is to implement the reduction algorithm.
- Advanced CUDA – Optimising to Get 20x Performance: This presentation covers Tesla 10-series architecture details, optimisation case study demos for particle simulation, finite difference and molecular dynamics.
- Better Performance at Lower Occupancy: A presentation that entails how better performance can be achieved by assigning more parallel work to individual thread and by using Instruction-level parallelism.