Overlapping Work and MPI
Often, just a few processes or threads are capable of saturating the network connection to a node. This would be true of an Omni-Path network link on Stampede3, for example (according to Intel Omni-Path documentation, Sec. 4.9.4). In such cases:
- It may not be optimal to use all the cores to communicate
- Instead, communicate using just one core or several cores
- Can do work with the rest during communication
- Must have support for MPI_THREAD_FUNNELEDor higher to do this
- Can be difficult to manage and load-balance!
    The code example above assumes that thread == 0 is sufficient to identify the main thread. However, as we noted in the discussion of funneled MPI, it may be safer in some situations to test the value of the flag returned by a call to MPI_Is_thread_main().
    If the MPI communication is expected to be quick compared to the work that the threads are doing, then it might be desirable to get the main thread to assist with the work. To enable that, you would move the work out of the else block, and (optionally) replace the if block with an omp master construct, since the latter adds no implicit barriers.
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)