Considerations
Now that we have introduced and described how to use RMA calls, it's worth saying a few things about when to use them. Clearly, they're a bit more complicated than the standard point-to-point operations, so there must be some reason for their existence. Really, there are two main reasons why RMA is a useful tool to have in your MPI programming toolbox.
- Speed - You probably saw this one coming. One-sided communication methods can be significantly faster than point-to-point calls, though there are, of course, a number of caveats to this statement, which are described below.
- Ease of Programming - While MPI_Send may initially seem easier than Start/Complete/Wait/... that isn't necessarily true. If you truly have interspersed communication/computation sections in your application, then fence and start/complete are very easy to use, and you may find that they free you from worrying about message tags and matching up send/recvs. And certainly, those with programming experience on shared-memory systems will find one-sided communication a more natural way to program.
Now, any claims of improved performance due to One-sided communication must come with multiple qualifications.
- The efficiency of RMA is highly implementation-dependent. It would be quite possible to write a rather poor implementation of RMA that was built upon nothing more than Barrier and Isend/Irecv. Intel MPI on Stampede2 provides efficient implementations (and improvements continue) of RMA calls, so you can actually see significant gains in performance.
- Speed is also hardware-dependent. Low-latency fabrics like Omni-Path natively support RMA operations, and a suitable MPI implementation can take advantage of these operations to boost performance.
- To realize better performance, RMA has to be properly used (of course). The three synchronization methods that were demonstrated have varying levels of synchronization specificity. Using fence on thousands of processes so that 10 of them can communicate will be highly inefficient. Conversely, if all 1000 processes need to communicate, using lock will be difficult to program correctly and probably inefficient.
When you do write code that exploits One-sided communication, you should keep in mind a couple of things. First, check the specific assertions that calls allow; they can let you take advantage of optimizations that the implementer may have provided, given appropriate hints from the programmer. Second, much of the speed improvement from RMA comes from the ability to make many communication calls from within just a pair of synchronization calls. Therefore, be sure to group your calls accordingly.