Quickstart: GPU Offloading with OpenMP

In this quickstart, we will take a familiar serial C program, parallelize it across CPU cores using OpenMP, then offload the same loop to a GPU using OpenMP's target directives. We will cover the parallel for directive and the target teams distribute parallel for directive, compiling with nvc, and submitting a GPU job on Vista.

Basic familiarity with C and Linux is sufficient, but for a better understanding of this quickstart, we recommend completing the OpenMP roadmap first. Note that this quickstart serves as an introduction to OpenMP offloading, and the code shown should be treated as an example, not as an optimal implementation.

OpenMP is a directive-based Application Programming Interface (API) for developing parallel programs on shared memory architectures. When #pragma omp directives are inserted into C, C++, or Fortran code, the compiler will insert calls to spawn threads to run code regions in parallel,or possibly to offload execution to an accelerator such as a GPU. Loops with independent iterations are the most suitable candidates for OpenMP, as described in Is My Code Suitable for GPUs?.

Before you begin, make sure you can log into Vista, navigate the file system, and submit Slurm jobs (see Vista Quickstart: Log In and Submit a Slurm Job).

Vista includes the NVIDIA HPC SDK as a default module, providing access to NVIDIA compilers nvc, nvc++, and nvfortran, all of which support OpenMP and GPU offload. If a newer version of NVIDIA HPC SDK is desired, you may check what versions are available:

For this quickstart, create a directory:

OpenMP is most commonly used in code with large loops, where each iteration is independent of others and is processed by an independent thread. The following program is a classic example: a serial C program computes the dot product of two arrays.

You might notice that each iteration of the loop adds to the same variable sum, which makes the loop not completely parallel. This pattern is called a reduction, and OpenMP provides built-in support for it in the next section.

Copy this code into dot_serial.c and verify it works by compiling and running it on the login node:

You should see: dot product result is correct. Next, we will parallelize this program, first across CPU cores, then on the GPU.

When adding the OpenMP parallel for directive in the line above the loop, the OpenMP runtime spawns a team of threads that divide the loop iterations across CPU cores. We also add the reduction clause because every thread would otherwise update the shared variable sum at the same time, which creates a race condition. The reduction clause gives each thread a private copy of sum initialized to 0, and combines them at the end of the loop.

Copy the modified program into dot_omp.c and compile with the -mp flag, which enables OpenMP in nvc:

Run this program on the login node, and change the environment variable OMP_NUM_THREADS to control how many threads are spawned:

Again, you should see dot product result is correct. Next, we will offload the loop to the GPU.

To run the loop on a GPU, we must replace the parallel for directive with the target teams distribute parallel for directive. This combined directive moves code execution to the device, creates teams of threads on the GPU, and distributes the loop iterations across them.

We also tell the compiler how to move data between the host and the device. The map clause specifies which arrays are inputs (copied to the device before the loop), outputs (copied back after), or both. Here, both x and y are needed as the input, so we use map(to: ...) to copy them in.

This piece of code is highly portable. If the compiler does not support or did not enable OpenMP, the directives are ignored and the code compiles as plain serial C.

Save the complete offload program into dot_offload.c:

Compile with -mp=gpu to enable OpenMP GPU offload, adding -gpu=cc90 to specialize compilation for the H200 GPU present on the compute node.

Login nodes do not have a GPU, so we either start an interactive session on the GPU node or submit a job to run on the GPU node. Create a batch script called dot_offload.sh:

With export OMP_TARGET_OFFLOAD=MANDATORY, the program will fail if the GPU is unavailable instead of silently falling back to the CPU. This is a simple way to check if offloading to GPU actually happened.

Replace the Allocation Name

Replace myproject in the #SBATCH -A line with your project allocation name. If you only have one allocation on Vista, this line may be omitted.

Submit the job:

Check the status of your job with squeue -u $USER. Once the job completes, view the output file:

You should see:

©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)