Another common definition of scalability assumes you can increase the problem size as you increase the number of PEs. It's called weak scaling. This scaling applies if you brought your problem to Frontera not just to finish the same calculation faster but to do larger calculations. In this model, the problem size increases with N, the number of PEs. If you had to do the problem on one machine, it would take

\[total~time = serial~time + N*(parallel~time~for~a~single~chunk)\]

Computing with N PEs, it becomes

\[total~time = serial~time + (parallel~time~for~a~single~chunk)\]

Assuming the serial time doesn't change too much as the problem size grows, processing with more and more PEs will always take about the same amount of time, because the number of fixed-size chunks increases in tandem with N. (We recover Amdahl's Law on the previous page if the chunk size instead shrinks like p/N, where p is a constant).

Weak scaling is associated with Gustafson's Law, which defines the speedup W as the ratio of the two times above. It says that by doing a greater amount of work on N processors, you are in fact getting your overall work done faster due to a "scaled speedup" -

\[W = N - a*(N-1)\]

In this law it is implicit that a, which is the serial fraction of the per-process total workload, does not change much as N (and the problem size) grows. In other words, on each PE, the chunk of parallel work stays roughly the same, and therefore a does not approach 1 (or it does so very slowly).

Trivially parallel work

A special case of weak scaling is called trivially parallelizable or embarrassingly parallel. Some algorithms naturally decompose into many independent parts; often these parts are so completely independent that, once they begin, they need not communicate until the end of the computation, if at all. If your code processes lots of disjoint pieces of data, or if it searches a broad parameter space, it may fall into this category. There isn't a lot of work to do to make code like this run well for thousands of nodes. But many common schemes for coordinating sub-tasks from a manager task run into problems when there are many thousands of worker tasks. Look for bottlenecks in how the manager handles the worker tasks.

There are yet other ways to define scalability. For instance, users who want to optimize the amount of work a machine completes for the group might opt for an isoefficiency measure, requiring that each core in a parallel job complete (say) at least 66% of the work it would complete in a serial job.

Capability vs. capacity

However you define scalability, the ultimate goal is to ensure that your application makes good use of a large-scale resource like Frontera. You'd like to show that Frontera enables you to do more. At the very least, you want to be able to convince the reviewers that your allocation request is worthy of an award! So what kind of scalability are the reviewers looking for? And, how can you demonstrate that your application possesses it?

Jobs that run on any world-class HPC resource can be divided into two rough categories, capability and capacity runs. Capability runs use most of the resources of the machine in a single job. On Frontera, this means using thousands of processor cores in order to solve the largest problems possible. When many smaller jobs use the machine simultaneously, these are called capacity runs. Even though an individual job may take longer to complete its work, running such jobs on fewer nodes is generally more efficient. The overall throughput of the resource increases for capacity runs. But the reviewers are looking for capability runs, because those are the ones that can be done nowhere else.

Interestingly, codes with weak scaling are typically the ones that can do the capability runs, rather than the ones with strong scaling. The reason is that strong scaling usually applies only over some finite range of N. Often it breaks down when N becomes large. This means that weak scaling is in some sense more desirable. However, the extreme case of a trivially parallelizable code may not be the best candidate for running on Stampede, because if you think about it, you are really filling up the machine with a bunch of capacity runs instead of a big capability run.

 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Inclusivity Statement