How to Send Work to the GPU
In GPU computing, all operations begin on the CPU. It acts as the host and orchestrates the overall execution of a program. The CPU is responsible for preparing data, allocating memory, and launching tasks on the GPU, which serves as a specialized co-processor. Although the GPU performs the heavy lifting for parallel computations, it cannot operate independently—it must be directed by the CPU.
The general term for designating the tasks that are to be performed by the GPU (or other device) is offloading. This is a powerful technique for accelerating data-parallel tasks on GPUs in high-performance computing. Unlike CPUs, which are more oriented toward sequential processing, GPUs consist of thousands of smaller cores designed for handling multiple operations simultaneously. This makes them good candidates for subtasks such as matrix operations, image processing, and deep learning.
Developers can offload computations to the GPU using technologies like CUDA for NVIDIA hardware, or the OpenMP target construct for broader hardware support. By leveraging the GPU, applications can achieve significant speedups, especially when processing large datasets or performing complex numerical simulations.
Here are some C++ code snippets illustrating how to offload work to the GPU using CUDA and OpenMP, two of the major techniques for defining the tasks to be performed by the GPU. These are not by any means the only methods that are available; a sampling of other techniques is covered in the companion topic, GPU Portability Solutions.
1. Using CUDA (NVIDIA GPUs)
CUDA is the most direct and powerful way to offload work to NVIDIA GPUs. Any task to be executed on the GPU is defined in a separate function decorated with the __global__ specifier (in most cases). Such a function is termed a kernel. Any calls to this function must be preceded by an "execution configuration" giving the number of threads that will execute the kernel on the GPU. Further details about CUDA syntax are provided in the Introduction to CUDA roadmap.
In the example below, the add() kernel (lines 1–4) and the call to it (line 11) are highlighted. Within add() , each thread uses its unique thread index to determine which index of the overall array it is responsible for computing. The kernel call splits the threads into blocks of 256; however, add() is expecting this, and it calculates the array index from the block index and the thread index within the block. (Note, since CUDA enforces a cap on the number of threads in a block, line 2 could be further expanded to account for multiple blocks, in case of very large arrays.)
Example: Vector Addition with CUDA
CUDA syntax is an extension to standard C++. Therefore, CUDA code must be compiled with NVIDIA's proprietary compiler, nvcc.
2. Using OpenMP Target Offloading (CPU/GPU)
OpenMP 4.5+ supports GPU offloading with the target directive. It modifies a standard parallel for directive so that the subsequent loop executes on a target device. The precise target is specified at compile time. If no target is specified, the code executes on the CPU. This is a very big contrast from CUDA code, which will only execute on NVIDIA GPUs.
Normally, a parallel for loop would be divided into several chunks to be executed by different threads on a CPU. But on a GPU target, where groups of threads must operate in SIMT fashion, every iteration is sent to a different thread. The OpenMP target directive also accepts various clauses that can be used to guide execution and data movement.
The OpenMP 4.5+ offloading standard is supported by most compiler families, including GCC (10+), Clang, and Intel oneAPI. You need to make sure that your compiler is able to do OpenMP offloading to your particular GPU. As an example, the nvc++ compiler supports OpenMP offloading, but only for NVIDIA GPUs. (Note, nvc and nvc++ are not the same as nvcc, the CUDA compiler for C/C++; they should not be confused!)
The example below illustrates the straightforward process of offloading a GPU-friendly portion of a C++ code (lines 6–10, highlighted) to a GPU targe using OpenMP.
Example: Vector Addition with OpenMP Offloading
The OpenMP map clauses help to move data to and from the device efficiently. CUDA might transfer data more efficiently, but as we will see, it requires some extra syntax to control the data movement. This syntax is addressed on the next page.
Summary of Features: CUDA and OpenMP
The following table summarizes the main points from the above discussion.
| Feature | CUDA | OpenMP Target Offloading |
|---|---|---|
| Hardware | NVIDIA GPUs | NVIDIA, AMD, Intel GPUs (limited) |
| Control | Fine-grained | High-level, easier to write |
| Compiler | nvcc |
g++, clang++, icpx, nvc++ |
| Portability | NVIDIA only | More portable (if supported) |
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)