Exercise: CUDA and OpenMP
In this exercise, the goal is to compile and run the CUDA and OpenMP codes presented earlier, in which the loop that contains most of the work is offloaded to the GPU. The instructions below should be valid for any system that is equipped with at least one NVIDIA GPU, plus appropriate installations of the NVIDIA HPC SDK and GCC. The steps have been verified to work on TACC's Vista. On a system like TACC's Frontera—which has only the NVIDIA CUDA Toolkit and lacks a special build of GCC—only Part I will work.
Part I: Compile and Run the CUDA Code
- Download the C++ code (or copy the program text from the linked page, or from the previous page):
-
Compile the CUDA code with
nvcc. Using this compiler requires the presence ofg++(or another compatible compiler) on the host in order to compile the host-side portion of the code. On Vista, both of these compilers are available in the default environment. On Frontera, it is necessary to load a module to makenvccand its associated libraries available:
Invoke the compiler to generate the executable: -
Start an interactive session in a GPU-equipped queue and run the executable:
Part II: Compile and Run the OpenMP Offloading Code
- Download the C++ code (or copy the program text from the linked page, or from the previous page):
-
Compile the OpenMP code with
nvc++. This distinct NVIDIA compiler is also present in the default environment on Vista. Then, compile the code a second time, omitting the-fopenmpflag so that the OpenMP directives are ignored. This produces a single-threaded code for the CPU: -
Start an interactive session in a GPU-equipped queue and run both of the executables. Confirm that they produce identical results:
-
The OpenMP offloading code can also be compiled with
g++. In this case, the default compiler on Vista will be inadequate, as it is not enabled to generate codes for NVIDIA GPUs. Thus, an error will occur unless a module is loaded prior to compiling:
Note that an extra option is required to specify that the offload target is an NVIDIA GPU. If the-foffloadflag is missing, then by default, the target is the CPU. And if both-foffloadand-fopenmpare absent, then the result is a single-threaded code for the CPU. -
Run the executable in your interactive session:
Extra credit: recompile the OpenMP code without the-foffloadoption and confirm that it works on a machine that has no GPUs at all, e.g., on one of Vista's "gg" nodes.
©
|
Cornell University
|
Center for Advanced Computing
|
Copyright Statement
|
Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)