Architecture Overview
Stampede2 is built out of a combination of Intel Knights Landing (KNL) processors and Intel Xeon Skylake processors. The KNL portion consists of 4200 KNL compute nodes, each with 68 cores per node, and 4 hardware threads per core.
Key points:
- KNL processors have a complex internal structure: each processor consists of 36 tiles interconnected by a 2D mesh, and each tile contains 2 cores + 2 VPU/core + 1 MB L2 cache. In addition, 16 GB MCDRAM is configured as a Level 3 (L3) cache.
- Optimizing performance on a cluster requires understanding how the different processor and memory components interact. As with any modern processor, particular attention must be paid to getting sufficient data from memory to processors to keep the processors busy.
-
For the KNL portion of Stampede2, the practices TACC recommends in TR-16-03 include the following:
- Target AVX-512: Compile your code using
-xMIC-AVX512
. - Parallelize: Parallel codes may scale very well with increasing core count. Serial codes will likely perform at levels consistent with the KNL's reduced clock rate.
- Vectorize: Vectorized codes may achieve up to 16x double precision speedup.
-
Use MCDRAM: Unless your code is latency bound, use MCDRAM. In cache mode (the default queue on Stampede2), this happens automatically,
assuming your total memory usage is less than the available MCDRAM. In flat mode, launch your code with
.numactl --membind=1 <exe>
- Manage thread count: Optimal performance typically occurs at 1-2 threads per core.
- Manage MPI task count: A minimum of 2 MPI tasks per KNL helps achieve OPA bandwidth saturation.
- Leave 2-4 cores free: This helps manage OPA traffic and improves scalability.
- Target AVX-512: Compile your code using
- On Stampede2, the
-xMIC-AVX512
flag is specific to the KNL nodes. For additional information about compiling for Skylake nodes on Stampede2, or compiling a multi-architecture "fat" binary, see the Stampede3 User Guide.