Appendix: Finer Memory Slices
The table in the main text illuminates the per-SM or per-core capacities that pertain to different memory levels. However, it is perhaps fairer to look at how large a slice of each memory type is available to a single CUDA core in a GPU, vs. a single vector lane in a CPU. We again take the NVIDIA Tesla V100 and a couple of contemporary Intel Xeon server-grade processors as the examples.
- Register file. In the NVIDIA Tesla V100, the register file of an SM stores (65536)/(2x32) = 1024 floats per CUDA core. In the Intel Cascade Lake and Skylake chips, a CPU core has a physical register file that stores 168 vector registers of 64B, or 10.5 kB in total. This works out to (16x168)/(2x16) = 84 floats per vector lane, which is an order of magnitude less than what is available to a CUDA core. (The factors of 2 appear because a CPU core can handle 2x16-float vectors/cycle, and a GPU SM can handle 2x32-float warps/cycle.)
- Cache sizes. In the NVIDIA Tesla V100, an SM has 128 kB (max) in its L1 data cache, and 64 kB in its constant cache. Adding these to its share of the 6 MB shared L2 (6/80 = 0.075 MB) yields 0.26 MB per SM, or 0.0041 MB per CUDA core. In the Intel Cascade Lake and Skylake chips, a CPU core has 32 kB in its L1 plus 1 MB in its L2 data cache. Adding these to its share of the shared L3 cache (1.375 MB) yields 2.4 MB, or 0.75 MB per vector lane, which is two orders of magnitude more than what is available to a CUDA core.
- Cache lines. In any GPU, the 128-byte cache lines consist of four 32-byte sectors. In the event of a cache miss, not all 4 sectors in the cache line have to be filled, just the ones that need updating. This means that 4 sectors from 4 different cache lines can be fetched just as readily as 1 full cache line. The 32-byte sector size persists today, even though the Volta's HBM2 memory has a 64-byte interface. By contrast, in an Intel Cascade Lake or Skylake processor, data always travel all the way from RAM to registers in full cache lines of 64 bytes.