Synchronization Overhead
Any time a task spends waiting for another task is considered synchronization overhead. For instance, tasks may synchronize at an explicit barrier where they all finish the computations for a simulation timestep before updating shared data and computing the next timestep. In this case, the slowest task determines the speed of the whole calculation. Synchronization can be more subtle, such as when one task must wait for another to update a global array variable or when a task waits for another to finish writing a file to disk.
A waiting task executes wait or test functions, blocking the process assigned to the task from performing additional work. This waiting time is called synchronization overhead. For distributed memory programs, MPI profiling tools can help identify places with synchronization issues. Time greater than a few microseconds spent in at MPI_Barrier() — or other wait or test functions — is significant synchronization overhead. Additional synchronization time may be hidden in send/receive functions or collective communication. The time that these latter calls should take is the amount of data transferred divided by the bandwidth of the network. Anything in excess of that is either system overhead or synchronization overhead.
Minimizing synchronization overhead is a very important part of making a program efficient. Since synchronization overhead tends to grow rapidly as the number of tasks in a parallel job increases, it is the most important factor in obtaining good scaling behavior for your parallel program. Networks for large parallel computers are designed to minimize the time required to synchronize processes, but software design remains crucial.
Avoiding synchronization overhead starts with good program design. As explained in Data Dependency, dependent calculations should be grouped into the same thread of execution if possible. If data need to be communicated between tasks, it is best to calculate the data that will need to be sent as early as possible and to initiate a non-blocking send for those data before doing other less critical calculations. That way, the dependent task will wait the least possible time for the data it needs before proceeding. Similarly, a task that needs data from another task should post a non-blocking receive for the data that will be needed as early as possible so the data can be transferred to that task in the background as soon as it is ready. Then the dependent task should go ahead with other calculations that do not depend on the requested data. Organizing your computation so that calculations that have data dependencies are separated from those that do not, doing the sends and receives early, and doing the waits late, allows the tasks to be temporarily out of synch without either of them having to wait, thus avoiding synchronization overhead.
Another way to reduce synchronization overhead is to avoid using MPI_Barrier(). This particular function is very rarely needed, except for applications with graphical user interfaces. Beginning parallel programmers like to put in MPI_Barrier() calls so they can assure themselves that they know where every task is before proceeding, but before the program goes into production, all of those calls should be removed. Normally, data dependencies between tasks can be handled using only the individual message-passing semantics.