Over time, the Intel Xeon line, as with other brands of advanced processors, has increased overall core counts with each generation, allowing for greater on-chip parallelism. Thus, if we look at Intel's server-class line of Xeon Scalable Processors, their first-generation "Skylake" and second-generation "Cascade Lake" processors are generally similar to each other in terms of architecture, with the latter providing slightly higher core counts, clock speeds, and memory speeds.

In this tutorial, we will focus primarily on the high-end Intel Xeon Platinum chips—in particular, the Intel Xeon Platinum 8160 Skylake processors that largely power the Stampede3 supercomputer at TACC, and the Intel Xeon Platinum 8280 Cascade Lake processors that power Frontera. Specifications for the 8160 and 8280 models installed at TACC are summarized on the next page. Note that Skylake is often abbreviated in TACC's documentation as "SKX", and Cascade Lake as "CLX". We will use those abbreviations at times throughout this tutorial.

Key features introduced into Xeon Scalable Processors

As compared to the previous processors in the Intel Xeon line, the Scalable Processors introduced some key new features:

  • 512-bit vector registers and the associated AVX-512 instruction set
  • a re-architected L2 & L3 cache hierarchy
  • improved performance and scalability with mesh on-chip interconnect

Expanding on each of these in turn:

The 512-bit vector registers and AVX-512 instruction set, which allow for greater SIMD parallelism, were also introduced into the Xeon Phi line, in particular into the KNL chips that were previously installed in Stampede2, the predecessor to Stampede3, at TACC. See our companion material on vector registers and instruction sets for a more detailed discussion of AVX-512 and its relation to different generations of Intel hardware.

The re-architected L2 & L3 cache hierarchy involves a larger L2 cache (1 MB per core) and a non-inclusive L3 cache (1.375 MB/core), such that the private L2 becomes the primary cache and the shared L3 is used as an overflow cache. This memory hierarchy is optimized to enable multithreaded workloads to operate on greater amounts of data per thread while reducing interconnect and L3 activity.1

As core counts have increased, the Xeon Scalable Processor line introduced a new Mesh Interconnect Architecture to efficiently connect the cores within a processor, moving beyond the coupled ring architecture that had supported previous Xeon processors.

Additional features introduced into Cascade Lake and successors

While largely similar, there are a few newer features that distinguish the second generation CLX processors from the SKX predecessors2:

  • a new AVX-512 VNNI (Vector Neural Network Instructions) instruction set, intended to support computations with deep learning neural networks
  • the capacity to interface to Intel Optane DC persistent memory, which is used to support large memory nodes on Frontera at TACC
  • integrated hardware-based security mitigations against side-channel attacks

The new VNNI instructions are part of an ongoing trend in supporting the use of lower numerical precision in neural-network-based Deep Learning (DL) algorithms. It has been recognized for some time that neural network algorithms can operate effectively with lower precision representations of data (e.g., 8-bit and 16-bit integers), without appreciable impact on algorithm prediction accuracies in many cases. Without obvious algorithmic performance tradeoffs, the benefits of lower precision include a reduction of memory bandwidth bottlenecks and better use of cache, allowing for greater data throughput with less use of power.

The AVX-512 instruction set introduced in the Xeon Scalable Processors have 512-bit wide Fused Multiply Add (FMA) core instructions, which can enable lower numerical precision multiplies with higher precision accumulates as needed for many Deep Learning algorithms. The AVX-512 VNNI instructions introduced in Cascade Lake processors provide further support along these lines, by introducing new fused FMA instructions that streamline this multiplication and accumulation even further.

These instructions are part of a larger suite of technologies named Intel Deep Learning Boost that are being rolled out to support these types of workloads. It should be noted, however, that these sorts of improvements to vector hardware and instructions impact CPU/VPU-based deep learning, even though much of the excitement and usage of DL algorithms focuses on GPUs that can execute a large number of the relevant operations in parallel, which is especially useful for training models from data.

The trend of expanding AVX-512 to deal with new types of workloads continues in other ways as well. For example, in 2022, Stampede2 acquired 224 nodes featuring third-generation Intel Xeon Scalable Processors (code name "Ice Lake" or ICX). These ICX processors are equipped to handle "Sunny Cove Core New Instructions", which are additional 512-bit instructions related to hashing, encryption, and vector bit manipulation.3 In 2024, when Stampede2 evolved into Stampede3, nodes based on fourth-generation "Sapphire Rapids" (SPR) processors were added, offering even more instructions for operating on 8-bit integer and 16-bit floating-point formats (INT8, FP16, BF16) which are useful in DL and artificial intelligence applications.4

1Kumar et al. 2017, The New Intel Xeon Scalable Processor (formerly Skylake-SP)
2Kumar et al. 2018, Future Intel Xeon Scalable Processor (Codename: Cascade Lake-SP)
3Patrick Kennedy 2020, Intel Ice Lake-SP Next-Gen Xeon Architecture at HC32
4Ian Cuttress 2021, Intel Xeon Sapphire Rapids: How To Go Monolithic with Tiles

 
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement