A baseline scaling study indicates how well the code performs when distributed across various numbers of MPI ranks, either on a single node (intra-node scaling) or multiple nodes (inter-node scaling). Background information for this segment can be found in the Scalability and MPI topics of the Cornell Virtual Workshop.

Video Transcript
Key points:
  • Intra-node scaling study compares performance on a single KNL node (with 68 cores @ 1.4 GHz, and 2 x 512-bit vector registers) versus that on a single Sandy Bridge Xeon node (with 16 cores @ 2.6 GHz, and 2 x 256-bit vector registers).
  • As expected, scaling with increasing number of MPI ranks plateaus when that number exceeds the number of available cores; and for small number of ranks, the code runs faster on the Sandy Bridge Xeon at 2.6 GHz than the KNL at 1.4 GHz, although the KNL clock-speed deficit is offset slightly by the longer vector registers.
  • The substantially larger number of available cores on KNL allows for overall better performance with a larger number of MPI ranks.
  • Across multiple nodes, maximum performance is achieved with 128 total ranks, distributed either as 4 nodes x 32 ranks per node or 8 nodes x 16 ranks per node. Scaling efficiency degrades beyond 8 nodes. This is perhaps due to either MPI communication overhead, or core/thread resource contention.
 
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement