Memory management refers to how to allocate memory space and transfer data between host and device. Managing memory on a CUDA device is similar to how it is done in CPU programming. CUDA makes the host code responsible for managing the data that must be made available to both the host and the device. The host code will need to:

  • allocate memory space on the GPU device and the host,
  • initiate the transfer of the data from the host to the device,
  • call the kernel functions to compute on the data,
  • initiate the transfer of the output from device memory back to the host once the kernel finishes, and
  • finally free the allocated memory.

In what follows, we present some of the basic functions in the CUDA API that pertain to these steps of memory management.

1. Memory Declaration and Allocation

In memory allocation, the host needs to allocate memory on both the host and the device. Typically, malloc() calls are used to allocate host memory space, followed by any necessary data initialization to this memory location. Then, using the CUDA API function cudaMalloc(), memory space is allocated on the GPU device. The parameters of cudaMalloc() are just a pointer and the size of the memory space on the device.

Below, h_t1 is the host memory array and d_t1 is the device memory array.

Notice that malloc() returns the starting address of the allocated memory, which the caller then assigns to a pointer variable. In contrast, cudaMalloc() returns an error code (ignored here), and the starting address is passed back via an argument. (Specifically, it is passed by dereferencing the pointer to the array pointer in the caller. The type of the pointer in this example is cast as void** instead of int**; while doing such a cast is not strictly necessary, it is a common practice to ensure that cudaMalloc() works for any datatype on the device.)

Pinned Memory Allocation

Normally, any allocated memory is pageable—it can be swapped between the physical memory and the disk. When available memory is low, pageable memory is swapped out to disk, giving programs the illusion of having more memory than what is physically available. Accordingly, when a GPU device attempts to transfer data stored in pageable memory from the host to the device, it first checks if the memory has been paged out and signals to the CPU to swap back memory if necessary. This process will incur extra CPU overhead.

To improve data transfer speed between host and device, one can pin the host memory, which prevents it from being swapped out to disk. This allows the system to use Direct Memory Access (DMA) transfer between the host and device. Pinning memory is encouraged, but it should not be overused, because allocating very large memory can often slow down the system. Here is an example of using cudaMallocHost(), which not only allocates memory on the host but also pins it:

If ordinary, non-pinned memory is allocated on the host using just malloc(), the system will typically copy data from that memory into a pinned buffer, prior to transferring it the device. This will incur additional CPU overhead, compared to storing the original data in pinned memory.

2. Memory Transfers Before and After a Kernel Call

Once memory space is allocated and populated with data, the data is transferred from the host memory to the global memory of the device using the API function cudaMemcpy(). The parameters of cudaMemcpy() are the following: the destination pointer, the source pointer, the size of the copy, and the direction of transfer. CUDA uses the following keywords to indicate the direction:

  • cudaMemcpyHostToHost
  • cudaMemcpyHostToDevice
  • cudaMemcpyDeviceToHost
  • cudaMemcpyDeviceToDevice

Here is an example showing memory transfers from the host to the device and back again. Notice that between the transfers, the kernel function thread_multi() receives the address of the data on the device through the argument d_t1:

An important detail about cudaMemcpy() is that it blocks the host from continuing code execution until the device finishes making its copy. We will see how this property is crucial in the exercise.

3. Memory Deallocation

Once the kernel completes its execution and data is copied back from the device, we need to deallocate memory using the API functions free() and cudaFree(). Notice that any memory declared using CUDA API, e.g., cudaMalloc() or cudaMallocHost(), needs to be freed using cudaFree() as follows:


The above aspects of memory management will be highlighted in the exercise.

 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)