Post-Start-Complete-Wait
One of the biggest problems with MPI_Win_fence is that it requires the synchronization of the entire communicator. This may be useful for some operations, but, in many cases, it is advantageous to be able to provide more specificity as to which processes actually need to communicate. For this purpose, the MPI_Win_start and MPI_Win_complete functions are provided.
int MPI_Win_start(MPI_Group group, int assert, MPI_Win win)
/* access epoch: communication between members of group */
int MPI_Win_complete(MPI_Win win)
- group
- defines the MPI_Group of the processes which will communicate in this epoch
- assert
- assertions (for optimizations)
- win
- the window the epoch will act on
These two function calls look much like calls to MPI_Win_fence, but they add the ability to specify an arbitrary group of processes that will be involved in the communication. Also, these calls appear only on the origin (calling) process, allowing for further definition of the RMA by the target process. The target process must call the corresponding post and wait functions:
int MPI_Win_post(MPI_Group group, int assert, MPI_Win win)
/* exposure epoch: target window exposed */
int MPI_Win_wait(MPI_Win win)
Now that we have introduced these functions, it makes sense to distinguish between an access epoch and an exposure epoch. Fence obscures this difference for simplicity of use. Here, however, we clearly see that post and wait define an exposure epoch, the time in which a target window is exposed for RMA calls. On the other side, start and complete define an access epoch, where RMA communication calls may be executed.
This paradigm should look and feel a lot like non-blocking point-to-point communication. The call to complete will not return until the communication calls in that access epoch have completed (think MPI_Waitall). RMA transfers will not begin until post has been called (think MPI_Irecv). The underlying implementations, together with specific assertions, allow some flexibility in behavior here. For example, one can tell MPI_Win_start to block until the corresponding MPI_Win_post has been posted, instead of returning immediately.
While the fundamental paradigm is that of non-blocking point-to-point communication, there are differences. Instead of receiving specific messages, the target process simply exposes a chunk of memory that may be altered several times by multiple processes. Another important distinction is that many calls can be ganged together inside of a single access epoch, which helps increase efficiency (by minimizing synchronization time).
The simple example below demonstrates how to use these functions so that both processes 1 and 2 put data to process 0.