Petascale with MPI?
There are petascale applications running today, on Frontera and elsewhere. They can run on hundreds of thousands of processor cores. They succeed mainly by using algorithms that favor local interactions among neighboring tasks; by achieving excellent load balance in order to reduce synchronization cost; by minimizing all synchronizations in general; and by avoiding the excess latency associated with global communications.
Nearest neighbor vs. all-to-all
As we have seen, it is often wise to decompose an algorithm so that tasks need to communicate only with nearest neighbors (in a "process topology"). Then as the number of tasks N increases, the amount of communication done by each task will either not increase much or stay unchanged. Specifically, given a constant total workload (strong scaling), the communication per unit of work will increase marginally; and if the work is allowed to grow with the processor count (weak scaling), the communication per unit of work will remain the same.
The worst alternative is that all tasks must communicate with all other tasks, which is nominally an N-squared operation. In this case, the percentage of time spent in useful computing drops dramatically as the number of tasks increases. An illustration of this is molecular dynamics with long-range forces. If every particle exerts a force on every other particle, then (in a naive implementation) every task must communicate with every other task. Clearly you'd like your algorithm to avoid this.
Load balance vs. synchronization cost
Other kinds of global communication may be less costly than all-to-all, but they still have a negative influence. One example that occurs in many MPI codes is a simple barrier, or synchronization among tasks. On the surface, MPI_Barrier falls in the same class as MPI_Alltoall or MPI_Allgather: they are MPI collective operations involving global communication. But an MPI_Alltoall with large messages can take O(N) time, whereas an MPI_Barrier tends to take log(N) time. So how bad is the barrier?
Many applications, in order to coordinate tasks, rely on barriers and other types of global communication at regular intervals. This results in increased synchronization. These lockstep applications proceed only as quickly as the slowest task. Keeping one task from being slower than another is called load balancing. While it is evidently essential to scalability, load balancing becomes increasingly difficult to achieve as N becomes large.
Symptomatic of a load balance problem is that as N grows, you see a larger and larger fraction of the computational time being used by your MPI collective calls, due to synchronization delays. You can judge the scalability of your application by examining the percentage of time spent waiting for global communications. If it is clear that detailed load balancing will be impossible to achieve, then you need to find ways to make the application less synchronous.
Subtle latency costs for large N
Even well-balanced algorithms can suffer millisecond delays from context switches, interruptions from hardware, or other operating system events. In this context, these events are called noise, and their importance can be difficult to quantify (P. Beckman et al., 2008).
Source | Magnitude | Example |
---|---|---|
Cache miss | 100 ns | accessing next row of a C array |
TLB miss | 100 ns | accessing a rarely used variable |
HW interrupt | 1 µs | network packet arrives |
Page table entry miss | 1 µs | accessing newly allocated memory |
Timer update | 1 µs | process scheduler runs |
Page fault | 10 µs | modifying a variable after fork() |
Swap in | 10 ms | accessing load-on-demand data |
Pre-emption | 10 ms | another process runs |
A delay of 0.2 ms doesn't seem bad, if it happens once per minute. But given the large numbers of tasks possible on Frontera, statistics dictate that these delays will happen to at least one of the tasks much more often. How does this affect your application? It increases the latency of the MPI collective calls, because tasks must wait with greater frequency for a random slowpoke to catch up.
There can be more catastrophic noise in a system, but this is an aberration. Full preemption of the main task, or delays due to I/O, can take thousands of cycles or even seconds. It may show up best in traces of MPI calls. You will most likely pay these (rarer) costs at the synchronization points associated with global communication.
Even for MPI calls which aren't global in nature, large numbers of tasks can slow MPI because each connection takes memory and other resources on the network card. As the number of tasks increases from 32 to thousands, the latency for sending MPI messages can increase by a factor of eight (Amith R. Mamidala et al., 2007).
Most MPI codes will find it difficult or impossible to avoid these kinds of extra costs at large N. This is one motivation for going hybrid: mixing MPI-parallel coding with multithreading techniques to achieve improved scalability.