Cornell Virtual Workshop > Parallel Programming Concepts and High Performance Computing > Data Communication

Collective Communication

The messages described in the previous section are called point-to-point messages because data are sent from a single sending task to a single receiving task. To communicate data from one task to all of the others using point-to-point messages, you might write a loop containing send calls for messages from the same buffer to each of the other tasks in turn. However, this is cumbersome. The alternative is MPI_Bcast(), or one of the collective communication functions defined by the MPI interface. Each routine involved in the broadcast must make the same function call with compatible arguments (for example, they must all agree on which task holds the data that will be broadcast).

Though implementation details vary, broadcasting commonly relies on a tree algorithm. Broadcasting to a large number of tasks using a tree algorithm is far more efficient than sending messages to each task inside of a single, large for loop. If the communication fabric of the computer has nonuniform behavior, the MPI implementation may adjust its collective communication algorithms to most efficiently deliver the data.

On the left, node Zero broadcasts to 14 other nodes, one at a time. On the right, boadcasting is organized as a tree, so each node that recieves the message sends it to two other nodes until all nodes receive the message. — Broadcasting using a tree structure is more efficient than a one-to-all structure.

Bcast() is just one of many data distribution functions in the MPI interface. For example, MPI_Scatter() distributes different blocks of data from one task to a set of other tasks. MPI_Gather() provides the inverse functionality by reassembling a distributed array in a single task. Other collective communication routines allow for simultaneous exchanges of data between all possible pairs of tasks (MPI_Alltoall).

Collective communication is not limited to data distribution. Some functions, like MPI_Barrier() are used to synchronize different processes. Other functions perform computation. For example, MPI_Reduce() performs arithmetic on corresponding elements of arrays in different tasks. The table below shows the main synchronization, data movement and computation functions defined by MPI.

Major categories of broadcast communication calls defined by MPI. The CVW MPI Collective Communications topic covers collective communication among MPI tasks in more detail. ^*Non-blocking collective communication was introduced in MPI-3.
Type	Blocking Routines	Nonblocking Routines^*
Synchronization	MPI_Barrier	MPI_Ibarrier
Data Movement	MPI_Bcast	MPI_Ibcast
	MPI_Gather	MPI_Igather
	MPI_Scatter	MPI_Iscatter
	MPI_Gatherv	MPI_Igatherv
	MPI_Scatterv	MPI_Iscatterv
	MPI_Allgather	MPI_Iallgather
	MPI_Allgatherv	MPI_Iallgatherv
	MPI_Alltoall	MPI_Ialltoall
	MPI_Alltoallv	MPI_Ialltoallv
Global Computation	MPI_Reduce	MPI_Ireduce
	MPI_Allreduce	MPI_Iallreduce
	MPI_Reduce_scatter_block	MPI_Ireduce_scatter_block
	MPI_Reduce_scatter	MPI_Ireduce_scatter
	MPI_Scan	MPI_iscan

MPI collective communication functions can be blocking or non-blocking. The blocking receive functions return after the buffers have received the data. The blocking send functions wait until data have been copied out of the buffers. Because of this implicit synchronization, a programmer should minimize blocking calls for collective communication. For example, suppose the program must broadcast several chunks of discontiguous data. In that case, it is likely to be advantageous to pack them into a single buffer, broadcast the buffer, and unpack the contents at the recipients.

Copying data into a secondary buffer allows the programmer to use non-blocking collective communication calls; subsequent program logic can change data in the original locations without concern about whether the collective communication is complete.

The main advantages of collective communication functions are (1) ease of use and (2) efficiency. Few programmers would go to the trouble of writing their own tree-style broadcast communication algorithm when efficient, well-tested alternatives exist.

Back