Cornell Virtual Workshop > Introduction to CUDA > CUDA Memory Model

GPU Memory Hierarchy

In the previous topic, basic CUDA concepts were introduced, covering the fundamental idea of GPGPU: the parallelization of computational tasks on GPUs. Indeed, using a GPU can speed up the execution time by orders of magnitude compared to a CPU, when processing large volumes of data in parallel. However, the simple parallelization of tasks is just the tip of the iceberg in CUDA programming; memory utilization is important, too. That is why popular CUDA libraries are heavily optimized to efficiently use all the resources the GPU offers, especially the GPU memory.

In typical CPU architecture, L1 cache is faster than L2 cache, and L3 is the slowest cache, but still much faster than main memory. Similarly, a GPU has several levels of cache memory to reduce read/write latency. Accessing memory from L1 cache, especially if it is sequentially ordered, takes only a few clock cycles. Accessing memory from a GPU's L2 cache or global memory relies more on random access, so it might take hundreds of clock cycles. However, the global memory is vastly larger than the caches, and the size of these different memory components also needs to be considered.

To cover the major memory-related features of CUDA, this topic is divided into three parts.

The first part covers CUDA's view of the basic memory architecture, elucidating the differences among the various memory components.
The second part offers basic step-wise instructions for memory management, focusing on how to transfer data between host and device. It also introduces CUDA's unified memory, an alternative in which on-demand data transfers are initiated automatically. Using one of these techniques is essential for CUDA programming, since you need to know how to transfer your data to the device first!
The final part is an exercise that provides examples demonstrating how to manage the various types of GPU memory, including the on-chip shared memory. This cache-like memory is significantly faster than the device's direct random access memory, and it is also much larger than the registers.

Back

© | Cornell University | Center for Advanced Computing | Copyright Statement | Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)