Unified Memory
In compute capability 6.0, CUDA introduced Unified Memory, also called Unified Virtual Memory or UVM. It allows both the host and device to access the same memory without the need to explicitly transfer data between host and device. However, this does not mean the data resides in both the host and the device memory at the same time. Unified Memory is migrated bidirectionally on demand.
In the Unified Memory model, data may be stored in the host or the device memory. When either the host or the device attempts to access data that is not in its memory space, a page fault occurs. This page fault triggers the automatic migration of whichever page of memory contains the data to the requested memory space. The page size is set by the host operating system and is typically 4 KB in size, so nearby data is included in the migration too, which may reduce the number of page faults and the associated overhead.
From the programmer's perspective, it looks as if both the host and device "share" the memory space. The main advantage of Unified Memory is that it makes memory management extremely easy. It is also more flexible, in that pages of memory that are not needed on the device are not transferred. However, there are situations in which better overall performance may be achieved through explicit memory management.
Here is an example to allocate an array of N
integers with Unified Memory:
Notice that the same pointer works on both the host and the device. In older versions of CUDA, the host code and device code might have different pointers, and the host code would have to retrieve the device pointer through a separate function call before passing it to the kernel function. But with Unified Virtual Addressing (UVA), this "mapping" is taken care of automatically, so only a single pointer is needed. Unified Memory (UVM) just takes UVA a step further by automating the memory transfers via paging.
Technical notes for TACC systems
On some TACC systems, the page size is larger than 4KB. If you execute getconf PAGESIZE
on one of Vista's GH200 nodes, you find that they use 64 KB pages, while Frontera's RTX 5000 nodes use the standard 4 KB pages. The larger page size on Vista implies fewer page faults, but this comes at the cost of bigger page transfers. However, the GH200 Superchip favors the bigger page size due to the fast NVLink-C2C interconnect between CPU and GPU. The GH200 obtains a further advantage from NVIDIA's Unified Memory driver, which combines page transfers through "density prefetching"; as a result, the host's page size is really the minimum transfer size. You can also trigger large, multi-page prefetches manually by calling cudaMemPrefetchAsync()
.
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)