Depending on the problem size and requirements, Service Unit (SU) utilization and wall-clock run times can scale differently on different architectures.

Video Transcript
Key points:
  • After porting the application code to the KNL nodes on Stampede2 (which have 64 cores per node, as opposed to 16 cores per node on Stampede), the same benchmark runs on 1 KNL node on Stampede2 in only 30% more wall-time as on 4 KNC nodes on Stampede, resulting in an overall reduction in allocation use by approximately 3X. This result is somewhat unexpected and counterintuitive, to save on allocations while taking more wall-time, and presents interesting trade-offs in how to manage jobs and allocations when migrating to a new system.
  • Depending on whether one wants to optimize the efficiency of SU utilization, or wall-time to completion, different node configurations are optimal. For this application and small problem sizes, performance enhancements from threading via OpenMP are minimal due to the overhead associated with initializing and finalizing the OpenMP environment in each iteration. For larger problem sizes, the penalty of this overhead is reduced, and some performance gains are realized.
  • In addition to traditional profiling and scaling studies described in more detail below, testing and performance evaluation are required to characterize these other aspects of resource allocation and optimization.
 
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement