In parallel computing, efficiency is the proportion of simultaneously available resources utilized by a computation. By definition, proper HPC code aims for the highest possible efficiency. For computationally-intensive code, we usually focus on whether every processor is always performing useful work for the algorithm. The nodes in Stampede2 have either 80 cores (ICX nodes), 68 cores (KNL nodes), or 48 cores (SKX nodes). An application running on an ICX node with a single thread of execution (a serial program) will be using only 1/80 of the computing power of the node — or even less if its operations aren't vectorized! An application that requires lots of memory will likewise be unable to attain the full memory bandwidth of the node with a single thread. A serial program running alone seriously underutilizes a cluster node and will waste your allocation.

Processor utilization can be measured even more stringently by asking how many floating-point operations are performed while completing a calculation. The ratio of FLoating-point OPerations per Second (FLOPS) to the peak possible performance is a common way to report overall efficiency for parallel code. The peak possible performance is calculated with the assumption that each processor core performs every possible floating-point operation during each clock cycle.

A common way to assess the efficiency of a parallel program is through its speedup. Speedup is defined as the ratio of the wallclock time for a serial program to the wallclock time for the parallel program that accomplishes the same work.

\[\text{speedup} = {\text{wallclock time for serial} \over \text{wallclock time for parallel}}\]
Single core execution expressed in terms of speedup.

When programs are limited by the computing speed of the processors, the speedup cannot be greater than the number of parallel resources on which the program is running. Therefore, for a task-based parallel program, a useful definition of efficiency is:

\[\text{parallel efficiency} = {\text{speedup} \over \text{number of cores}}\]
Definition of efficiency for a task-based parallel program.

A worthy goal for optimizing parallel programs is to bring their speedup as close as possible to the number of cores. Programs that are limited by some other factor, e.g., I/O, may fall far short of this goal.

A stylized plot of speedup as node count increases. Achieving a speedup proportional to number of nodes is not realistic as you increase the number of nodes. Even under ideal conditions, the serial portions of the program impose an upper limit on the speedup as the number of nodes increases. In a realistic case, coordination overhead may eventually surpass any gains from adding additional nodes.
Maintaining a linear speedup as the number of nodes increases is not realistic. According to Amdahl's Law (explained on the next page), any serial portions of the program impose a theoretical upper limit on the speedup as the number of nodes increases. In a realistic case, coordination overhead may eventually surpass any gains from adding additional nodes. Efficiency and scalability are covered in more detail in the CVW Scalability topic.

When optimizing a program, a programmer may use other criteria to gauge the effectiveness of particular optimization strategies more accurately. Still, we define speedup as we do because the ultimate goal of parallel processing is to get your results back as soon as possible, and the wallclock ratio is the direct measure of that.

 
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement