Cornell Virtual Workshop > Introduction to Advanced Cluster Architectures > Memory, Cache, Interconnects

Last Level Cache

Last level cache (LLC) refers to the highest-numbered cache that is accessed by the cores prior to fetching from memory. Usually, LLC is shared by all the cores. In Skylake and Cascade Lake, the LLC is the shared L3 cache. At 1.375 MB per core, the total size of the L3 is substantial: 33 MB per socket for a 24-core SKX, and 38.5 MB per socket for a 28-core CLX.

As noted earlier, the re-architected Xeon SP cache hierarchy is intended to support a broad variety of data-intensive computations, while reducing L3 activity. How is this reduction achieved? The shared L3 cache used to be inclusive in previous Xeon generations, meaning that it kept copies of all the cache lines that were fetched into the L1 and L2 caches of all the cores. But in Skylake and Cascade Lake, the L3 is non-inclusive; mainly, it holds cache lines that have been evicted from the cores' L2 caches.

Even so, the re-architected L3 plays a role in supporting efficient data movement to the registers. Since the L3 cache is non-inclusive, it largely contains data not present in any L2. This effectively expands the per-socket total cache capacity to the sum of L2 and L3, e.g., 57 MB for an SKX, and 66.5 MB for a CLX (more than doubling L2).

The tradeoff for this broadened cache capacity is a more complex search strategy for cached data. In previous Intel processors, cache lines that were not found in the inclusive L3 were not found in any lower-numbered caches, either. Thus, an L3 miss would automatically trigger a fetch from memory, and no further search was needed. But in Skylake and Cascade Lake, a separate "snoop filter" must keep a catalog of all the cache lines that are present in the cores' L2 caches, as well as the L3, in order to avoid needless fetches from main memory.

But Intel chips are not the only ones featuring shared LLCs. AMD processors have them as well. The difference in AMD's Zen microarchitectures (Milan, Rome, etc.) is that a processor is composed of multiple chiplets, and as a consequence, multiple shared L3's coexist within a single processor. Each L3 is shared by a subgroup of cores called a core complex or CCX; the CCX size was 4 in Rome, and it became 8 in Milan. Thus, a 64-core Rome processor has 16 CCX's of 4 cores each, while a newer 64-core Milan processor has 8 CCX's of 8 cores each.

However, AMD's federated approach to L3 is not as different from Intel as it seems. Intel's L3 cache is not really a single entity; instead, each core "owns" a slice of the L3, and sharing is accomplished via inter-core communication. Furthermore, in both cases, the shared L3 must actually extend to other processors in the same node.

Thus, in both architectures, any given core must be able to learn the status of data that might be found in other cores' L2 and L3 caches, not just in its own socket, but also across sockets. This introduces a type of nonuniformity in the speed of access to L3, which begins at the CCX level for AMD's Zen, and at the socket level for Intel's Xeon SP.

In any case, the last level cache or LLC (i.e., the last one consulted before fetching from RAM) is in general distributed within and among processors in a node, implying that some kind of scheme must be in place to check the availability of a cache line in one of the distributed locations.

Back