Going Hybrid
The question now becomes, how does an application make the best use of the bandwidth available within a node—and especially within each socket on a node? For MPI-only applications having regular communication topologies, there is an easy strategy that can be fairly effective: make sure neighbors are grouped together on the same processor, because neighbors tend to communicate more with one another. By default on Frontera, consecutive MPI ranks (neighbors, often) are assigned to occupy one processor on the first node, then the other processor on that node, then the next processor on the next node, and so on. With this pattern as the default, intra-node and even intra-socket message passing will account for well above what one might expect from unbiased statistics.
For many applications, though, a different approach to the communication among tasks on the same node must be considered, namely, shared memory parallelism through OpenMP or some other form of multithreading. Adding shared-memory capability can be beneficial to an application's overall scalability. Among other advantages, it prevents MPI from having to launch and manage an excessive number of MPI processes.
Processes and their threads
It is evident that on an architecture like Frontera's, an application benefits from a parallel strategy that distinguishes between tasks on the same node vs. tasks on different nodes, because the latter involves slower communication links. Furthermore, tasks on the same node can share their memory directly. Hybrid programming refers to coding techniques that make such a distinction, typically by using a combination of MPI and OpenMP. Each MPI process becomes the parent of multiple OpenMP threads, which share the overall process memory within a node.
The topic of hybrid programming with MPI and OpenMP is covered in depth in its own roadmap. But one question that often comes up is, how many MPI processes and OpenMP threads ought to be launched per node? Should a single MPI process simply fork 56 OpenMP threads, to cover the entire node? Or should there instead be two MPI processes per node, each one forking 28 threads? When an application effectively runs 56 independent copies of the same thread, this may make little difference. Suppose, however, that the application naturally allows groups of 28 threads to share memory. How likely is that an MPI process and all 28 of its threads will get assigned to the same socket (i.e., processor), so the 28 threads will share memory on just that socket?
Let's start by looking at how a process and its threads are allocated memory on a machine that has more than one socket, each with its own local memory. Allocation of resources is the job of the machine's operating system. Linux commits memory when it is first written, and by default, it will allocate physical memory on the socket local to the thread. So the easy answer is to ensure that the processes start on different sockets, so that their threads will naturally be assigned to the same socket as the parent.
This method would work well, except that the operating system reschedules processes and threads to different processors depending on workload. It is quite possible for a thread to move to a different socket, allocate more memory there, and be scheduled back to its original location, while still owning a chunk of memory on the remote socket. This is not necessarily a huge problem. Let's say the data are read-only: data will travel quickly through the L3, L2, and L1 caches of all the cores involved and stay in the caches as long as there's room.
If there isn't room in the caches, then a simple strategy would be to give each thread its own private copy of the data, since the operating system is less apt to move a thread after it has made a local memory allocation. OpenMP makes it easy to specify which variables are thread-private, and it might be worth the cost in memory and time that it takes to make extra private copies. But then we are no longer taking advantage of shared memory.
Jumpers on different sockets, sharers on same socket
The bottom line is that the 56 Cascade Lake cores on each Frontera node compete with each other for resources, especially memory. This means the best placement of threads on cores often depends on patterns of memory access. Threads with widely spaced or random access patterns into memory tend to benefit from being spread across both processors, allowing them to access all 4 memory controllers and reducing the chances of stalls. But threads that share memory more closely can benefit more from being on the same processor, because they can share its L3 cache.
To make that happen, we might want to exert direct control over how processes and threads get mapped onto the hardware. In effect, we'd like to take over the normal role of Linux in making such decisions. Fortunately there are several methods for doing this, as we will see on the next page.