Measuring Efficiency
In parallel computing, efficiency is the proportion of simultaneously available resources utilized by a computation. By definition, proper HPC code aims for the highest possible efficiency. For computationally-intensive code, we usually focus on whether every processor is always performing useful work for the algorithm. The nodes in Stampede2 have either 80 cores (ICX nodes), 68 cores (KNL nodes), or 48 cores (SKX nodes). An application running on an ICX node with a single thread of execution (a serial program) will be using only 1/80 of the computing power of the node — or even less if its operations aren't vectorized! An application that requires lots of memory will likewise be unable to attain the full memory bandwidth of the node with a single thread. A serial program running alone seriously underutilizes a cluster node and will waste your allocation.
Processor utilization can be measured even more stringently by asking how many floating-point operations are performed while completing a calculation. The ratio of FLoating-point OPerations per Second (FLOPS) to the peak possible performance is a common way to report overall efficiency for parallel code. The peak possible performance is calculated with the assumption that each processor core performs every possible floating-point operation during each clock cycle.
A common way to assess the efficiency of a parallel program is through its speedup. Speedup is defined as the ratio of the wallclock time for a serial program to the wallclock time for the parallel program that accomplishes the same work.
When programs are limited by the computing speed of the processors, the speedup cannot be greater than the number of parallel resources on which the program is running. Therefore, for a task-based parallel program, a useful definition of efficiency is:
A worthy goal for optimizing parallel programs is to bring their speedup as close as possible to the number of cores. Programs that are limited by some other factor, e.g., I/O, may fall far short of this goal.
When optimizing a program, a programmer may use other criteria to gauge the effectiveness of particular optimization strategies more accurately. Still, we define speedup as we do because the ultimate goal of parallel processing is to get your results back as soon as possible, and the wallclock ratio is the direct measure of that.