MITB Banner

Top 22 Resources You Need To Run Successful Projects Using CUDA

Share

CUDA is a parallel computing architecture created by NVIDIA and is specifically designed to be used with NVIDIA GPUs. It is utilised in different sectors of science and research applications in medical imaging, financial modelling and energy exploration. With the increasing usage of GPUs, which have made computing more efficient, CUDA is growing rapidly and has caught the interest of numerous techies.

In this article, Analytics India Magazine compiles a list of the top open-source resources for NVIDIA’s parallel programming platform:

Libraries

  • CUDPP

CUDPP is a low-level library which is recommended by many in order to reach the best possible performance for CUDA. This library provides 15 parallel primitives, and when compared to Thrust, this is more performance-oriented but compromises a little on the programmer productivity.

  • Thrust

Thrust is a library for parallel algorithms which is similar to C++ Standard Template Library (STL). Thrust offers interoperability, high-level interface and high-performance.

  • Hemi

Hemi simplifies writing CUDA C/C++ code. One can write parallel kernels similar to writing loops in line in CPU code. One can launch C++ Lambda functions as GPU kernels, and the details like thread block size and grid size are optimisation details rather than requirements.

  • CUB

CUB stands for CUDA Unbound. CUB was extensively designed for CUDA applications. CUB is a performance-oriented library, and because it was explicitly made for CUDA, it is slightly more flexible than Thrust.

  • Chag::pp- Parallel primitives library

The authors of chag::pp has the fastest implementation of Stream Compaction and Prefix Sum (demonstration). The library provides implementations of reduction, prefix operations (scan), radix sort, compaction.

Best papers on Cida

  • Efficient Parallel Scan Algorithms for Many-core GPUs: The paper shows how scan and segmented scan algorithms can be implemented using divide and conquer approach. This divide and conquer approach builds all scan primitives on top of a set of primitive intra-warp scan routines.
  • Multireduce and Multiscan on Modern GPUs: As the name suggests, the paper gives details about how one can implement Multireduce and Multiscan on the GPU. This is a master’s thesis by Marco Eilers.
  • Modern GPU: Modern GPU describes all the algorithms and strategies for coding CUDA as fast as possible, it also consists of a library with all the explained concepts are implemented.

NVIDIA’s CUDA programming guide and best practices:

Online Courses for CUDA

  • CUDA programming Masterclass – Udemy: This course contains details about parallel programming on GPUs from basic concepts to advanced algorithm implementations. This course includes organisations of threads like blockDim, blockldx and gridDim. Also, this course teaches about Unique index calculation for the 2D grid, sum array implementation, memory transfer between device and host and device properties.
  • [Coursera] Heterogeneous Parallel Programming by Wen-mei W. Hwu (University of Illinois): This course from Wen-mei W. Hwu introduces concepts, languages, techniques and patterns for programming heterogeneous and massively parallel processors. This course covers heterogeneous computing architectures, data-parallel programming models, techniques fro memory bandwidth management and parallel algorithm patterns.

Best books:

  • CUDA by example by Edward Kandrot, Jason Sanders
  • Programming massively parallel processors: A hands-on approach by David Kirk and Wen-Mei Hwu.
  • GPU Gems Edited by Wen-Mei Hwu
  • CUDA application programming by Rob Farber
  • Parallel Calculus in CUDA by Nicolas Hecquet
  • CUDA Cookbook by Bharatkumar Sharma, Jack Han

Presentations:

  • CUDA C/C++ BASICS: This presentation is from NVIDIA corporations. The presentation explains the concepts of CUDA kernels, threads, thread blocks, thread synchronisation, memory management, shared memory.
  • Optimising Parallel Reduction in CUDA: This presentation shows how relatively simple and fast it is to implement the reduction algorithm.
  • Advanced CUDA – Optimising to Get 20x Performance: This presentation covers Tesla 10-series architecture details, optimisation case study demos for particle simulation, finite difference and molecular dynamics.
  • Better Performance at Lower Occupancy: A presentation that entails how better performance can be achieved by assigning more parallel work to individual thread and by using Instruction-level parallelism.
Share
Picture of Sameer Balaganur

Sameer Balaganur

Sameer is an aspiring Content Writer. Occasionally writes poems, loves food and is head over heels with Basketball.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.