How to Move Data to the GPU
The host-device model of GPU computing means that data must somehow be transferred from the CPU to the GPU before computations can begin, and the results must be transferred back once computations are complete. Efficient GPU programming, therefore, involves minimizing the data transfer overhead and carefully managing the interaction between CPU and GPU to maximize performance.
Here are two complete, but simple, code examples showing how data might be transferred between the CPU and GPU using CUDA and OpenMP. The transfers are essential to enable the computations from the previous page, after the input arrays are initialized.
1. Using CUDA (NVIDIA GPUs)
In CUDA, the host initiates memory allocations on both the host and the device for any data that must be made available to both. The example below shows the how the memory allocations are done, along with the necessary data transfers between the host and device. In the example, the host performs each allocation and transfer separately and explicitly.
CUDA supplies various functions that the host will use to affect the device: functions for allocating memory on the device (lines 15–17), as well as functions for copying the input data from host to device (lines 19–20) and the output data from device to host (line 24). The host is even responsible for freeing the device memory at the end of the program (line 28). Prototypes for these functions are contained in cuda_runtime.h (line 2).
Interestingly, the CUDA kernel only learns about the device-side data that it will operate on when it receives the device-side pointers from the host in the kernel call to add().
Full Example: Vector Addition with CUDA
CUDA is also able to manage data transfers automatically through the use of CUDA Unified Memory. With this method, equivalent memory areas are allocated on the CPU and the GPU, and a single address space is assigned to both. As a consequence, a page fault on one side triggers a data transfer from the other side, and explicit data transfers become unnecessary. The API for this technique is not shown; if it were, the above example would feature calls such as cudaMallocManaged(&a, N * sizeof(int)).
2. Using OpenMP Target Offloading (CPU/GPU)
As mentioned earlier, one or more map clauses can be appended to the OpenMP target directive to inform the compiler about the data movement that is required to offload a parallel code region to the GPU. The use of map is not mandatory, but the code that is produced is likely to be more efficient if the compiler is made aware of the intended directionality and duration of data transfers.
Line 9 below already contains everything that the compiler needs to know about the data movement associated with offloading the parallel for loop to the target device.
Full Example: Vector Addition with OpenMP Offloading
Data can also be preloaded to the GPU using the more flexible #pragma omp target data construct. This directive, which may be applied to any region of code, allows the transferred data to persist over a region larger than just a single loop. Other, related constructs in OpenMP allow you to refine how and when data will move between the host and the GPU.
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)