The messages described in the previous section are called point-to-point messages because data are sent from a single sending task to a single receiving task. To communicate data from one task to all of the others using point-to-point messages, you might write a loop containing send calls for messages from the same buffer to each of the other tasks in turn. However, this is cumbersome. The alternative is MPI_Bcast(), or one of the collective communication functions defined by the MPI interface. Each routine involved in the broadcast must make the same function call with compatible arguments (for example, they must all agree on which task holds the data that will be broadcast).

Though implementation details vary, broadcasting commonly relies on a tree algorithm. Broadcasting to a large number of tasks using a tree algorithm is far more efficient than sending messages to each task inside of a single, large for loop. If the communication fabric of the computer has nonuniform behavior, the MPI implementation may adjust its collective communication algorithms to most efficiently deliver the data.

On the left, node Zero broadcasts to 14 other nodes, one at a time. On the right, boadcasting is organized as a tree, so each node that recieves the message sends it to two other nodes until all nodes receive the message.
Broadcasting using a tree structure is more efficient than a one-to-all structure.

Bcast() is just one of many data distribution functions in the MPI interface. For example, MPI_Scatter() distributes different blocks of data from one task to a set of other tasks. MPI_Gather() provides the inverse functionality by reassembling a distributed array in a single task. Other collective communication routines allow for simultaneous exchanges of data between all possible pairs of tasks (MPI_Alltoall).

Collective communication is not limited to data distribution. Some functions, like MPI_Barrier() are used to synchronize different processes. Other functions perform computation. For example, MPI_Reduce() performs arithmetic on corresponding elements of arrays in different tasks. The table below shows the main synchronization, data movement and computation functions defined by MPI.

Major categories of broadcast communication calls defined by MPI. The CVW MPI Collective Communications topic covers collective communication among MPI tasks in more detail. *Non-blocking collective communication was introduced in MPI-3.
Type Blocking Routines Nonblocking Routines*
Synchronization MPI_Barrier MPI_Ibarrier
Data Movement MPI_Bcast MPI_Ibcast
MPI_Gather MPI_Igather
MPI_Scatter MPI_Iscatter
MPI_Gatherv MPI_Igatherv
MPI_Scatterv MPI_Iscatterv
MPI_Allgather MPI_Iallgather
MPI_Allgatherv MPI_Iallgatherv
MPI_Alltoall MPI_Ialltoall
MPI_Alltoallv MPI_Ialltoallv
Global Computation MPI_Reduce MPI_Ireduce
MPI_Allreduce MPI_Iallreduce
MPI_Reduce_scatter_block MPI_Ireduce_scatter_block
MPI_Reduce_scatter MPI_Ireduce_scatter
MPI_Scan MPI_iscan

MPI collective communication functions can be blocking or non-blocking. The blocking receive functions return after the buffers have received the data. The blocking send functions wait until data have been copied out of the buffers. Because of this implicit synchronization, a programmer should minimize blocking calls for collective communication. For example, suppose the program must broadcast several chunks of discontiguous data. In that case, it is likely to be advantageous to pack them into a single buffer, broadcast the buffer, and unpack the contents at the recipients.

Copying data into a secondary buffer allows the programmer to use non-blocking collective communication calls; subsequent program logic can change data in the original locations without concern about whether the collective communication is complete.

The main advantages of collective communication functions are (1) ease of use and (2) efficiency. Few programmers would go to the trouble of writing their own tree-style broadcast communication algorithm when efficient, well-tested alternatives exist.

 
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement