Memory and Cache
How fast can Xeon Scalable Processor chips go? To achieve the nominal peak flop/s rate, every VPU has to execute a fused multiply-add (FMA) instruction on vectors of 16 floats (= 512 bits) on every cycle. The FMA is a pipelined instruction that can effectively double the flop/s rate — but only for operations can be expressed as a multiplication followed by an addition. Under the big assumption that this can be arranged, we make the following calculation, using the pair of Skylake SKX processors in Stampede3 node as an example:
\[{48\:\text{cores}} \times 2 {\text{VPUs}\over \text{core}} \times 16 {\text{floats}\over \text{VPUs}} \times 2 {\text{flops}\over \text{cycle}} \times 2.1 {\text{Gcycle}\over \text{s}} ≈ 6450 {\text{Gflop} \over \text{s}}\]
How fast is fast enough? The figure that really matters to floating-point performance is the number of vectors that can be processed in each cycle. So let's put the SKX memory bandwidth numbers into context by converting them to the vectors per cycle that can be moved into and out of RAM.
The unit of data transfer between memory and cache is the cache line. Conveniently, in SKX (as well as Cascade Lake CLX), the cache line size is as the same as the vector width, 512 bits or 64 bytes, which makes our calculation easy. On Stampede3, the peak memory bandwidth for DDR4 RAM is 128 GB/s/socket, or 256 GB/s/node (see this footnote #1 for details). Therefore, the approximate number of vectors that can be processed is:
\[ {256\:\text{GB}\over\text{s}}\:\times\:{{\text{cache line}\over64\:\text{B}}\:\times\:{\text{s}\over2.1\:\text{Gcycle}}\:≈\:{1.9\:\text{vector}\over\text{cycle}}} \]
Processing close to two vectors/cycle looks great, until we remember that there are 48 cores on the SKX nodes, each of which can do 2 vector multiplies and 2 vector adds in every cycle. To feed these vector units so that each is performing a computation such as d[i] = a[i]*b[i] + c[i]
for vectors a, b, c, d
, we would need to be able to move 6 vector operands and 2 vector results from and to RAM every cycle, for all 48 cores. This would require a bandwidth of 8*48 = 384 vectors/cycle to run at full capacity, as compared to our available throughput of approximately 2 vectors/cycle.
In practice, we don't need that much memory bandwidth. Some vectors of data will simply stay in registers while a computation is in progress. Most other vectors will be fetched from, or stored into, cache memory—the fast levels of memory that sit between the registers and RAM.
Thus cache is important too! Achieving the peak flop/s rate in SKX/CLX (like any processor) depends critically on a code's ability to reuse vectors that are already close to the cores—in registers, or in the L1d cache, or in the private L2 caches. Keeping the active data near the cores is what really makes the whole chip hum:
\[ \text{Local L1d cache:}\:\:\: {{{2\:\text{vector reads}\:+\:1\:\text{vector write}}\over{\text{cycles}\over\textbf{core}}}\:=\:{{3\:\text{vectors}\over{\text{cycles}\over\textbf{core}}}}} \]
\[ \text{Local L2 cache:}\:\:\: {{{1\:\text{vector read}\:+\:0.5\:\text{vector write}}\over{\text{cycles}\over\textbf{core}}}\:=\:{{1.5\:\text{vectors}\over{\text{cycles}\over\textbf{core}}}}} \]
The tradeoff is that these areas of memory are relatively small. The L1 data cache contains only 512 cache lines (or vectors) for a total of 32 KB per core for both SKX and CLX, yielding an aggregate of 1.5 MB for all 48 cores on a Stampede3 SKX node, or 1.8 MB on the 56 cores of a Frontera CLX node. The larger L2 cache engineered into the Xeon Scalable Processors (1 MB per core) results in 48 MB of L2 cache on each SKX node (24 MB per socket), and 56 MB on each CLX node (28 MB per socket). These numbers are also reproduced in our summary table on Xeon Scalable Processors at TACC.