Cornell Virtual Workshop > Code Optimization > Cache Considerations

Multi-Core Cache Sharing

Performance problems can arise in a multithreaded application due to contention, which is when different threads need access to the same (or just nearby) location(s) in shared memory. When is this situation likely to arise? The answer depends somewhat on the processor's microarchitecture. The following discussion pertains to the Skylake processor model currently present on Stampede2, the Intel Xeon Platinum 8160. However, the same principles would apply to any multi-core processor.

The memory subsystem of Intel Skylake (SKX for short) has three levels of cache. Each core has its own L1 and L2 caches, while the L3 cache, also called the Last Level Cache or LLC, is shared among cores. When a data item is fetched into L1 from main memory (or perhaps from LLC), it arrives in a 64-byte cache line. Obviously it is best for a core to be able to operate solely on cache lines that are found in the local L1 cache, or at least in the local L2. However, in a multithreaded application, it is possible for copies of the same cache line to be present in the L1 or L2 caches of multiple cores at the same time.

This leads to the important issue of cache coherence: what happens when one core modifies the value of an item in a shared cache line? The general answer for Skylake is that all copies of that particular cache line become invalid on the other cores that hold it. This is true even for cores on a different socket. So if that same cache line is needed later by another core, whatever may have been present in its local L1/L2 is no longer any good; the core might have to go all the way out to main memory to get a useable replacement. (Interestingly, it can also "snoop" to see if it can be obtained from another core.)

As mentioned previously, these fetch operations become progressively more expensive the further one goes out in the memory hierarchy. A diagram in our earlier discussion of memory access times summarized the situation for SKX; the table below summarizes the same data in a different way.

Stampede2 "Skylake" Processors Intel Xeon Platinum 8160 2.1 GHz Cache sizes and metrics pertaining to 1 core
L1d size = 32 KB (4096 doubles) L2 size = 1 MB (32 x L1d size) L3 (shared) size = 33 MB
Latency in FLOP units (where the peak rate is 16 FLOP/cycle for 64-bit doubles)
L1 latency = 4 cycles = ~60 FLOP L2 latency = 14 cycles = ~220 FLOP (additional) L3 latency = 55 cycles = ~880 FLOP (additional) Memory latency = 190 cycles ~3000 FLOP (additional)
Bandwidth-limited DAXPY performance with streaming accesses¹
L1 DAXPY = ~31% of theoretical peak L2 DAXPY = ~22% of theoretical peak L3 DAXPY = ~11% of theoretical peak Local memory DAXPY = <4% of theoretical peak

Thus every cache miss—including those that are due to a shared cache line being invalidated—represents a huge missed opportunity in terms of the floating-point operations (FLOPs) that could have been performed during the delay. This is particularly true if the only recourse is to fetch the cache line from main memory.

Incidentally, how is it that the cores are able to communicate with each other, so that they receive notification when a cache line is invalidated? The answer lies in the shared L3 cache or LLC. Each core holds a slice of the LLC, and these slices can communicate with each other and with main memory over the 2D mesh interconnect. The interconnect extends to the processor in the other socket via 3 Ultra Path Interconnect (UPI) links; together, these connections serve as the medium for carrying cache-related traffic between cores. Each core's L3 contains an inclusive directory that knows all the cache lines that are stored in the local caches. (The L3 cache itself is not inclusive; it may need to pull cache lines from L2 when requested.) If a cache line is not found in the shared L3 directory, then it is not in cache anywhere on the chip. On the other hand, if it is found in the L3 directory, then it is known which cores have the cache line, and these cores then become targets for snooping.

Given the above picture of the Skylake processor, what can be done to reduce contention for shared cache lines, in this multi-core system? One technique is to try to eliminate false sharing, which occurs when two or more threads require access to independent data items that happen to reside in the same cache line. Each load involving this fought-over cache line is likely to cause (at best) "snooping" on the caches of other cores to see whether a useable copy is available. But this scenario can be avoided. A concrete example may help to show how: imagine two integers, needed exclusively by each of two threads, both held in the same static struct. If the two integers are separated by at least 64 bytes of unrelated data, then the two integers will necessarily land in different cache lines, and false sharing will be prevented.

A problem similar to false sharing is cache pollution. This occurs when data are loaded into cache on a core but are never used there, due to conditional expressions in the code or because the prefetcher requested data beyond what was actually required. Cache pollution fills up the cache unnecessarily and possibly increases contention. It is harder to control than false sharing; there is no easy way to disable hardware prefetching, for example.

^{1. To reach peak rate, a DAXPY would need to read 2 vectors/cycle/VPU, or 4
vectors/cycle/core. Such a rate is impossible in Skylake; even theoretically, SKX
can do no more than 2 vector loads/cycle/core from L1. One of us (SRL) ran tests
of the actual, attainable L1 bandwidth on a Stampede2 SKX node using a
stripped-down version of the STREAM Triad benchmark. The
kernel
of this benchmark happens to be a DAXPY. To eliminate Turbo Boost effects, 48
copies of the test code were run simultaneously. Top bandwidth from L1 was found
to be 105 GB/s, which is just 40% of the theoretical peak of 268 GB/s. Ratios
presented in the table are somewhat lower because they based on the measurements
made by
Hammond, Vaughan, and Hughes 2018;
they found the top L1 bandwidth to be 84 GB/s, e.g.,
as noted
previously.^}

Back