Hybrid Strategies
How can we ensure that each MPI task gets uniquely assigned to a specific socket, so its 28 threads (say) will share memory on just that socket? Or, how do we spread out memory allocations across both the sockets, so we can exercise all the memory controllers at once? Here we discuss some techniques that go into arranging a hybrid MPI/OpenMP application.
Making local memory stay local: tying processes to sockets
Typically, one takes advantage of shell commands, library calls, and environment variables to help pin threads and memory to particular cores and sockets. As described in the hybrid programming roadmap, there is a shell command, numactl, that launches a process with chosen affinity and memory allocation policies. The appropriate options for pinning a task to one socket are "cpunodebind" (-N) and "localalloc" (-l). On some systems, taskset augments the capabilities of numactl.
Within an application, there are shared libraries that provide APIs for finer-grained control over individual threads. Linux pthreads offers pthread_attr_setaffinity_np(), as well as arguments to pthread_create() in the older nptl library. Lastly, Intel's special environment variable KMP_AFFINITY sets the overall affinity patterns for OpenMP threads. The "compact" pattern may be appropriate for teams of threads that share memory closely.
Interleaved memory allocation
On the other hand, accessing memory off-socket can be preferable when multiple threads run through lots of memory, or when they use it in a fairly random manner ("jumpers"). In such cases, it can be best to run the process with interleaved memory. This policy can again be set with numactl (-i). Each memory allocation is spread among the processors so that the average memory access is lower. This policy can also be helpful when the thread location isn't easily controlled.
It might be a good idea to experiment with KMP_AFFINITY here, as well. The "scatter" or "balanced" patterns may help to spread out the memory accesses.
Programming tricks in place of policy
Accessing memory in a smarter way isn't always a matter of finding the right wrapper or API to enforce a certain policy. Normally, C++ code using the Standard Template Library (STL) allocates memory in large chunks without consideration for NUMA architecture. One technique for combining C++ STL code with OpenMP is to use segmented iterators, which are multiple levels of iterators in order to iterate first over memory banks, then over the arrays in each part of memory.
But programming tricks may prove inadequate due to low-level code movement by the compiler, or because of unexpected side effects on cache reuse. Here is a somewhat complex example: in multi-threaded applications, atomic locks for synchronization require snooping on the memory bus, which is more expensive in a multi-processor system. To avoid such snooping, one can use a class of algorithms called non-blocking, lock-free, or lockless synchronization. Algorithms of this type may be used to access shared linked lists or queues, or to provide safe memory reclamation in such data structures. However, they can introduce subtle slowdowns due to unintended code reordering by the compiler (or even the processor), or through interactions with cache behavior. They can also fail to be faster if programmed incorrectly.
Summary
Memory access patterns are a key consideration in creating a hybrid application. Tasks that need to share memory will tend to go faster if they can be run as threads on the same node. They may go even faster if they can be broken into smaller tasks that run as threads on the same socket. However, if the tasks tend to jump around through lots of memory, it may help to spread the memory allocations around all the memory on the device.
The motivation for creating a hybrid program is to expose and express the multiple levels of parallelism that may be essential to producing an efficient, scalable application. The result can make it worth the added complexity.