Prior to 2022, Stampede2 offered approximately 18.2 PF of peak floating-point throughput. Of that, 12.8 PF was provided by the Xeon Phi (KNL) compute nodes, and the remaining 5.4 PF was provided by the Skylake Xeon (SKX) compute nodes. The KNL and SKX processors share a similar architecture, though with quite different sets of design tradeoffs, and they share similar (but not binary-compatible) 64-bit x86 instruction sets.

One key similarity between KNL and SKX is that both get a huge boost to their floating-point performance by having a large number of CPU cores with 2 vector units per core. Because their vector units function in largely similar ways, the concepts and techniques discussed in this Cornell Virtual Workshop topic apply to both processor types.

Bar charts of relative performance of Xeon and Xeon Phi, showing that vectorization and multithreading are more critical to Xeon Phi
Vectorization and multithreading as contributors to performance on Xeon and Xeon Phi (Image courtesy Intel)

The above figure illustrates the relative importance of parallelism to the performance of the processors. The purpose of vectorization is to utilize the parallel SIMD hardware that is available on a core. The purpose of multithreading is to utilize all the cores in the processor.

 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Inclusivity Statement