Cornell Virtual Workshop > Parallel Programming Concepts and High Performance Computing > Memory Access

Accessing Data

Moving data between and within nodes is the most time-intensive operation in HPC. Consequently, the architecture connecting processors and memory is an essential consideration in program design. To achieve optimal performance on an HPC system, a programmer should understand how the particular hardware organizes communication between the processor cores and the memory.

A typical node will have one or more processor chips, and each chip will contain multiple cores. Each core in a multicore processor will have coherent access to the same memory as the other cores in the same node. In contrast, nodes in a compute cluster will generally not have access to global memory, so the cores on one node will not have direct access to memory on other nodes. Special programming strategies are needed to utilize memory on more than one node.

Two nodes with 4 cores each. The nodes are connected with a communications network. — Cores on the same node can share memory. Cores on different nodes have access to different memory regions; they use the memory on their respective nodes.

When a program runs on multiple cores with access to the same physical memory, it can use a shared memory design. In shared memory programming, the computation occurs within a single multithreaded process, and each thread within the process has access to memory in the same virtual address space. Communication among threads is efficient because any changes to shared memory are immediately visible to all other threads in that process.

In contrast, when a program uses multiple processes, such that computational tasks do not share memory, its processes cannot communicate by writing updated values directly to memory. This situation requires a distributed memory design. In distributed memory programming, each process owns part of the data, and other processes must send a message to the owner in order to update that part of the data. The communication could be between processes running on the same node, or between processes running on different nodes in a cluster, but the underlying communication model is the same. Interface standards like Message-Passing Interface (MPI) facilitate distributed memory programming for cluster systems.

Two processes each own half of an array. Here, the second process must pass a message to the first process in order to change a value in the first half of the array. — In this example, the array v is divided into two parts and each part is given to a separate process. The processes store their own part of the data in their own memory allocations. When a process (process 1) needs to update the value of an element belonging to a different process (process 0), it needs to send a message to the process that owns the data.

In systems like Stampede2, there are many nodes, each with its own cores and memory. The memory associated with one node is not directly accessible from another, so programs that run on multiple nodes must use distributed memory programming. Since each node has its own instance of the operating system, a distributed memory parallel program has at least one separate process on each node.

While it is possible to run multiple processes on the same node, each will still have its own virtual address space, so distributed memory programming must still be used to coordinate processes running on a single node. Depending on the implementation of MPI, message transport between processes on the same node may actually travel via shared memory, but programmers should use distributed memory strategies to coordinate processes, whether there is one process per node, or many.

It is also possible to use a hybrid strategy, where multithreaded processes designed with shared memory programming take advantage of the multiple cores on a single node, while simultaneously using distributed memory strategies to coordinate with processes on other nodes. This is sometimes referred to as hybrid programming. It allows the programmer to map the parallelism that exists in the program onto the characteristics of the system.

Back