Active Hackathon

Top 22 Resources You Need To Run Successful Projects Using CUDA

CUDA is a parallel computing architecture created by NVIDIA and is specifically designed to be used with NVIDIA GPUs. It is utilised in different sectors of science and research applications in medical imaging, financial modelling and energy exploration. With the increasing usage of GPUs, which have made computing more efficient, CUDA is growing rapidly and has caught the interest of numerous techies.

In this article, Analytics India Magazine compiles a list of the top open-source resources for NVIDIA’s parallel programming platform:


Sign up for your weekly dose of what's up in emerging technology.



CUDPP is a low-level library which is recommended by many in order to reach the best possible performance for CUDA. This library provides 15 parallel primitives, and when compared to Thrust, this is more performance-oriented but compromises a little on the programmer productivity.

  • Thrust

Thrust is a library for parallel algorithms which is similar to C++ Standard Template Library (STL). Thrust offers interoperability, high-level interface and high-performance.

  • Hemi

Hemi simplifies writing CUDA C/C++ code. One can write parallel kernels similar to writing loops in line in CPU code. One can launch C++ Lambda functions as GPU kernels, and the details like thread block size and grid size are optimisation details rather than requirements.

  • CUB

CUB stands for CUDA Unbound. CUB was extensively designed for CUDA applications. CUB is a performance-oriented library, and because it was explicitly made for CUDA, it is slightly more flexible than Thrust.

  • Chag::pp- Parallel primitives library

The authors of chag::pp has the fastest implementation of Stream Compaction and Prefix Sum (demonstration). The library provides implementations of reduction, prefix operations (scan), radix sort, compaction.

Best papers on Cida

  • Efficient Parallel Scan Algorithms for Many-core GPUs: The paper shows how scan and segmented scan algorithms can be implemented using divide and conquer approach. This divide and conquer approach builds all scan primitives on top of a set of primitive intra-warp scan routines.
  • Multireduce and Multiscan on Modern GPUs: As the name suggests, the paper gives details about how one can implement Multireduce and Multiscan on the GPU. This is a master’s thesis by Marco Eilers.
  • Modern GPU: Modern GPU describes all the algorithms and strategies for coding CUDA as fast as possible, it also consists of a library with all the explained concepts are implemented.

NVIDIA’s CUDA programming guide and best practices:

Online Courses for CUDA

  • CUDA programming Masterclass – Udemy: This course contains details about parallel programming on GPUs from basic concepts to advanced algorithm implementations. This course includes organisations of threads like blockDim, blockldx and gridDim. Also, this course teaches about Unique index calculation for the 2D grid, sum array implementation, memory transfer between device and host and device properties.
  • [Coursera] Heterogeneous Parallel Programming by Wen-mei W. Hwu (University of Illinois): This course from Wen-mei W. Hwu introduces concepts, languages, techniques and patterns for programming heterogeneous and massively parallel processors. This course covers heterogeneous computing architectures, data-parallel programming models, techniques fro memory bandwidth management and parallel algorithm patterns.

Best books:

  • CUDA by example by Edward Kandrot, Jason Sanders
  • Programming massively parallel processors: A hands-on approach by David Kirk and Wen-Mei Hwu.
  • GPU Gems Edited by Wen-Mei Hwu
  • CUDA application programming by Rob Farber
  • Parallel Calculus in CUDA by Nicolas Hecquet
  • CUDA Cookbook by Bharatkumar Sharma, Jack Han


  • CUDA C/C++ BASICS: This presentation is from NVIDIA corporations. The presentation explains the concepts of CUDA kernels, threads, thread blocks, thread synchronisation, memory management, shared memory.
  • Optimising Parallel Reduction in CUDA: This presentation shows how relatively simple and fast it is to implement the reduction algorithm.
  • Advanced CUDA – Optimising to Get 20x Performance: This presentation covers Tesla 10-series architecture details, optimisation case study demos for particle simulation, finite difference and molecular dynamics.
  • Better Performance at Lower Occupancy: A presentation that entails how better performance can be achieved by assigning more parallel work to individual thread and by using Instruction-level parallelism.

More Great AIM Stories

Sameer Balaganur
Sameer is an aspiring Content Writer. Occasionally writes poems, loves food and is head over heels with Basketball.

Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM

The curious case of Google Cloud revenue

Porat had earlier said that Google Cloud was putting in money to make more money, but even with the bucket-loads of money that it was making, profitability was still elusive.

Global Parliaments can do much more with Artificial Intelligence

The world is using AI to enhance the performance of its policymakers. India, too, has launched its own machine learning system NeVA, which at the moment is not fully implemented across the nation. How can we learn and adopt from the advancement in the Parliaments around the world? 

Why IISc wins?

IISc was selected as the world’s top research university, trumping some of the top Ivy League colleges in the QS World University Rankings 2022