Cornell Virtual Workshop > Understanding GPU Architecture > GPU Memory

Comparison to CPU Memory

The organization of memory in a GPU largely resembles a CPU's—but there are significant differences as well. This is particularly true of the total capacities available at each level of the memory hierarchy. As mentioned previously, the GPU is characterized by very large register files, while the CPU is much more cache-heavy and has generally wider data paths.

The table below makes a comparison between the memory hierarchies found in the NVIDIA GPUs in TACC's Longhorn system and the Intel Xeon CPUs in TACC's Frontera and Stampede2 systems. In each row, the entry shown in bold highlights the type of device that has the bigger capacity per computational unit at that level.

Available memory per SM or core at each level in the memory hierarchy of the NVIDIA Tesla V100 vs. Intel Xeon SP processors.
Memory Type	NVIDIA Tesla V100 *- per SM -*	Intel Cascade Lake SP, Skylake SP *- per core -*
Register file	256 kB	10.5 kB
L1 cache	128 kB (max)	32 kB
Constant caches	64 kB	- N/A -
L2 cache	0.075 MB	1 MB
L3 cache	- N/A -	1.375 MB
Global or RAM	0.4 GB	>3.4 GB

Since the table compares what is available to a single GPU SM to the equivalent for a single CPU core, the processor-wide memory levels are treated as though they are distributed evenly among all the SMs or cores. A more precise comparison might break down these computational units even further and look at the sizes of the memory slices that pertain to a GPU CUDA core vs. a CPU vector lane. For those interested, this approach is pursued in the Appendix.

The memory levels do not match up exactly because in the Xeon processors, every core has a private L2 cache plus a slice of the shared L3, while in the V100 GPU, every SM has special constant caches, but no caches beyond a slice of the shared L2. Generally, though, the table shows that the GPU has the greater concentration at the closest memory levels, while the CPU has an evident size advantage as one moves further out in memory.

In considering memory speed, the roles of GPUs and CPUs become reversed. Latency and bandwidth are two relevant measures of memory speed. The next table shows that the Intel Xeon CPUs feature superior latencies for the caches that are closest to the cores, while the Tesla V100 has an edge in bandwidth at the more distant, global memory level.

Latency and available bandwidth per SM or core at each level in the memory hierarchy of the NVIDIA Tesla V100 vs. Intel Xeon SP processors.
Memory Type	NVIDIA Tesla V100¹ *- per SM -*		Intel Cascade Lake SP, Skylake SP² *- per core -*
Memory Type	Latency	Bandwidth	Latency	Bandwidth
L1 cache	28 cycles	128 B/cycle	4 cycles	192 B/cycle
Private L2 cache	- N/A -	- N/A -	14 cycles	64 B/cycle
Shared L2 or L3	193 cycles	17.6 B/cycle	50–70 cycles	14.3 B/cycle
Global or RAM	220–350 cycles	7.4 B/cycle	190–220 cycles	1.9–2.5 B/cycle

Latencies are not stated for registers because latency is dependent on the particular instruction being executed, not on data movement. Likewise, bandwidth is not an appropriate metric for registers; instead, one speaks of instruction throughput, which may reach or exceed one per cycle (if the pipelines of operands are kept full).

Despite the impressive bandwidth of the GPU's global memory, reads or writes from individual threads have high read/write latency. The SM's shared memory and L1 cache can be used to avoid the latency of direct interactions with with DRAM, to an extent. But in GPU programming, the best way to avoid the high latency penalty associated with global memory is to launch very large numbers of threads. That way, at least one warp is able to grab its next instruction from the instruction buffer and go, whenever another warp is stalled waiting for data. This technique is known as latency hiding.

Effectively, there is one additional layer of memory that ought to be considered for a GPU: the memory of the host. The need to transfer data between the host and the GPU can place a heavy constraint on the GPU's overall performance—particularly if the GPU is attached to the host via PCIe. Ideally, once the necessary data are transferred to the GPU, they remain there for computations as much as possible.

Since the latency cost for communicating data to and from the GPU is relatively high, data should be sent in big chunks or batches. Furthermore, different batches may be sent over separate streams, so that computation can begin on some batches while others are in transit. CUDA provides techniques for using streams to overlap computation with communication.

In the next section, we will continue to dig into the unique properties of the specific NVIDIA GPU devices that are incorporated into Frontera and contemporary HPC systems.

^{1. Reference for Tesla V100 data: Z. Jia et al., Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking, Tables 3.1, 3.2, and 3.4, and Fig. 3.12. Theoretical peak rates are shown above; if the peak rate is unknown, the measured
value is shown instead. Global bandwidths for the V100, such as 2155 GB/s for the L2 cache bandwidth, are converted to B/cycle/SM by dividing by
1.53 GHz and 80 SMs. (Note, the L1 "upper bound" rate in Table 3.2 of the reference is incorrect and should be computed by the formula in the
accompanying text, i.e., LSUs*bytes.)^}

^{2. The memory hierarchies of Cascade Lake and
Skylake are identical up to RAM, according to
WikiChip. Published data puts the minimum RAM latency of Cascade Lake
at 81 nsec and Skylake at 90 nsec;
these values are converted to cycles by multiplying by 2.7 GHz and 2.1 GHz, respectively. The L3 bandwidth is taken to be the 30 GB/s asymptote seen in
Fig. 3 of Hammond, Vaughn, and Hughes (2018), normalized by 2.1 GHz
for their Xeon 8160 (same model as in Stampede2); the result should apply to Cascade Lake as well. The RAM bandwidths assume that all 6 channels are
populated with the maximum 2933 MT/s DDR4 for Cascade Lake or
2666 MT/s DDR4 for Skylake; the global results are then divided by the number of cores per
chip, either 28 for Cascade Lake or 24 for Skylake, and normalized by the same frequencies as above (to match
Frontera and
Stampede2,
respectively).^}

Back