Exercise: Memory
In this exercise, several CUDA memory allocation methods and concepts are demonstrated. It should be noted that none of these demonstrations should be considered an optimal implementation of the example program, and the performance of these methods on your program may vary. If performance is a concern, please refer to the CUDA performance topics introduced later in the roadmap.
For simplicity, we will use the example of adding two arrays. The serial C code below will be parallelized with CUDA to run on a GPU:
Instructions
The above code can be run on a CPU after compiling it with any C compiler, including NVIDIA's nvcc
, which simply treats it as "host code". General instructions for how to run codes in batch on an HPC system were presented previously in Submit a Job to Slurm. But the first step is to copy the entire source into your clipboard and paste it into a file on the system you intend to use. For example, to run this code interactively on TACC Frontera (or a similar HPC system), you would enter these commands:
The goal of this exercise is to convert the add()
function into a kernel function that runs on the GPU, then try out different ways of transferring the data into device memory prior to calling the function, and transferring the results back to the host afterward. Each step assumes that you are using the code from the previous step (or a copy of it) as the starting point.
1. Kernel function
First, we modify the function add()
to be a kernel function. Instead of a for
loop, we let each thread calculate its index in the array with blockIdx.x
, blockDim.x
, and threadIdx.x
. Note that the blockDim.x
is the size of a block.
2. Allocate memory on the device
Next, we allocate memory on the device with the API function cudaMalloc()
. It takes a pointer (to an array pointer) and a size as its arguments. For the sake of clarity, we will rename some variables to distinguish between host pointers and device pointers.
3. Copy data from host to device
After initialization, we copy data from host to device with cudaMemcpy()
. The parameters of cudaMemcpy()
are the following: the destination pointer, the source pointer, the size of the data, and the direction of transfer. In this case, the direction of transfer from h_X
(host) to d_X
(device) is set by the predefined value cudaMemcpyHostToDevice
.
4. Call the kernel function
Next, we call the kernel function to perform addition on the arrays. To accommodate arrays of any length, we set the number of threads per block to be 256 and the number of blocks to be N/256 + 1
. When the kernel function is called, the arguments include pointers to the allocated memory on the device. This is how the kernel function knows where to find the arrays on the device.
5. Copy data from device to host
After d_c
is calculated by the kernel function, we copy the output back to the host. The function cudaMemcpy()
is used again, but with the direction reversed, as indicated by the predefined value cudaMemcpyDeviceToHost
.
6. Free memory on both host and device
The final step is to free the memory on both the host and the device. The function for the device is cudaFree()
, which takes as its argument the pointer to the memory to be freed.
The full code is below.
Call this code "mem_manage.cu", compile it with nvcc, and run it on a GPU node to verify that it works.
Pinned Memory
To use pinned memory, host memory needs to be allocated with cudaMallocHost()
instead of malloc()
and freed with cudaFree()
instead of free()
. The modifications to the code are shown below. Try compiling the modified code with nvcc and running it on a GPU node.
Shared Memory
The idea of shared memory is to have a shared workspace among the threads in a block. Threads may communicate with each other through shared memory. In this example, though, we simply make every thread in a block store the output of d_a + d_b
temporarily in shared memory before assigning it to d_c
. Since the number of threads in a block is fixed at 256, we can statically allocate an integer array of length 256 in shared memory. Different blocks must execute independently on different SMs, so there is no risk of one block overwriting the shared memory of another block.
The most straightforward approach would be to put __shared__ int shared[256]
in the kernel function. However, to ensure a change in block size does not break the code, we use a macro to define BLOCK_SIZE
, which is then used to declare the shared memory array. Note that __shared__ int shared[blockDim.x]
does not work, as NVIDIA compiler nvcc regards blockDim.x
as a variable.
In this next example, a new kernel function is added to the code, and changes are made to the way the kernel gets called, as shown below. Try making these modifications, compiling the code with nvcc, and running it on a GPU node.
Unified Memory
To use unified memory, memory needs to be allocated with cudaMallocManaged()
and freed with cudaFree()
. The host and device use the same pointer to access a given array, but all the array pointers must still be passed to device as arguments to the kernel function. Note that there is no need to call anything to transfer data between the host and the device; data are transferred from one to the other on demand, as needed. But as we will see, you may not get the data you expect: explicit synchronization between host and device may be required!
Make the modifications to main()
as indicated, then try compiling the code with nvcc and running it on a GPU node. However, don't expect it to work correctly, just yet...
If this code compiled successfully, the result at runtime should be that the assertion of c[i] == (a[i] + b[i])
fails. This is because the host function proceeded to assert the correctness of c[i]
before the device finished calculating. Hence, the correctness of the program depends on the host synchronizing with the device—or in other words, the host should wait for the device to finish.
We will use the function cudaDeviceSynchronize()
to ensure that the host waits for the device before checking the result. (This function will be introduced in detail in the next topic.) We could instead use the sleep()
function for this purpose, but there would be no way of knowing for certain whether we chose the delay to be long enough.
For the final step: add a synchronizing function call to main()
as shown, then try compiling the code with nvcc and running it on a GPU node.
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)