Cornell Virtual Workshop

Quickstart: Run a CUDA Program

Back

In this quickstart, we will walk through the steps to convert a serial C program into a parallel CUDA program that runs on a GPU. We will cover how to write a kernel function, allocate and transfer GPU memory, compile with NVIDIA's nvcc compiler, and submit a GPU job on Vista. Although this quickstart can be followed with just basic C and Linux familiarity, without prior experience with CUDA programming or GPUs, it is highly recommended to complete the Introduction to CUDA roadmap for a thorough understanding of the concepts introduced. This quickstart serves as an introduction to CUDA and is not intended to be the most effective way to offload computation to the GPU.

CUDA is a parallel computing architecture and API created by NVIDIA that enables NVIDIA GPUs to serve as a platform for general-purpose computing. Workloads that can be processed by independent, parallel threads are the most suitable candidates for GPU execution, as described in Is My Code Suitable for GPUs?.

Before you begin this quickstart, make sure you can log into Vista and navigate the file system (see Vista Quickstart: Log In and Submit a Slurm Job).

Vista provides CUDA through environment modules. Load the CUDA module to gain access to the nvcc compiler:

Verify that nvcc is available:

You should see the version of the CUDA compilation tools after running the command. We will use the NVIDIA nvcc compiler to compile CUDA code.

Create a directory for this quickstart:

CUDA programs are intended for data that can be processed in parallel. In properly implemented CUDA programs, data are partitioned into small chunks, each of which is processed by a unique thread on the device (GPU). If your code has a loop that constitutes the bulk of the computation and each iteration is independent, it is likely a good candidate for CUDA.

Alternatively, the following serial C program adds two arrays element by element. As you can quickly observe, the add function is the parallelizable function because each iteration is completely independent.

Copy this code into a file called add_serial.c and verify it works by compiling and running it on the login node:

You should see: array sum is correct. We will now convert this program to run on a GPU.

Since our goal is to run this program on the GPU, we must convert the function add into a kernel function, which is a function that runs on the GPU.

A kernel function is called from the CPU to execute on the GPU across many threads simultaneously. It must have the __global__ specifier and the return type void. Instead of using a for loop, each GPU thread computes its own index and processes one element:

Each thread uses the built-in variables blockIdx.x, blockDim.x, and threadIdx.x to compute a unique index tid, which tells the thread which element of the array it is responsible for. The if statement prevents threads from accessing elements beyond the array length.

Before the kernel can run, we need to allocate memory on both the CPU and the GPU. We use malloc() for CPU memory and cudaMalloc() for GPU memory. For clarity, we use the prefixes h_ and d_ to distinguish between CPU and GPU pointers:

Once the CPU arrays are initialized, we use cudaMemcpy() to copy data to the GPU, launch the kernel, and copy the result back:

The <<<grid_size, block_size>>> syntax tells CUDA how many threads to launch on the GPU. In this case, we will use enough "blocks" of 256 threads each to cover all one million elements, and every thread will run the function called device_add, with the GPU pointers and array size as arguments.

Once the kernel function completes and data is copied back, we need to deallocate memory using free() and cudaFree():

Save the complete program into a file called add_cuda.cu:

Compile the program on the login node with nvcc:

Since no GPU device is available on the login nodes, you must submit the program as a job to run on the GPU compute nodes. Create a batch script called add_cuda.sh with the following content:

Replace Your Allocation Name

Replace myproject in the #SBATCH -A line with your actual project allocation name. If you only have one allocation on Vista, this line may be omitted.

Submit the job:

You can check the status of your job with squeue -u $USER. Once the job completes, view the output file:

You should see the following text in the output:

Quickstart: Run a CUDA Program

Prerequisites

Identify a Parallelizable Function (or Use Our Example)

Write the Kernel Function

Manage GPU Memory and Launch the Kernel Function

Compile, Submit, and Check Results