Multi-Level Scaling
Steve Lantz, Andrew Dolgert (original)
Cornell Center for Advanced Computing
Revisions: 10/2023, 10/2014, 3/2012 (original)
In the prior topics of this roadmap, we examined the overarching design principles for scalability on large clusters, as well as common obstacles to achieving it. At the opposite end of the performance spectrum, one can delve into optimization guides that focus on chip-level tweaks. In reality, the whole range of scales matters when considering how to optimize performance on an HPC system like Frontera:
- Many thousands of interconnected nodes
- Multiple processors on a node
- Dozens of cores on a single processor
- Multiple execution units within a processor core
At every level in the hierarchy, parallelism is available and can be exploited. Yet parallelism is not the whole story of performance; data locality also matters. In memory, how close are the data to where they will be needed for the computation? A code may stress various I/O subsystems: MPI network transfers, database accesses, or Web services could be important, as could storage of data and logs.
On a system such as Frontera, overall performance optimization goes well beyond figuring out how to utilize 100% of the cycles of a given processor. Any one of the above factors could turn out to be the "long pole in the tent" that causes a performance hangup. In this topic, we look at some of the possibilities.
Objectives
After you complete this segment, you should be able to:
- Describe Frontera node architecture
- Explain how node architecture and network characteristics affect the bandwidth and latency between parallel tasks
- Distinguish between inter-node and intra-node communication
- Define the term hybrid programming
- Give reasons for assigning MPI tasks to specific sockets
- Describe how the operating system allocates memory to threads
- List memory-access considerations when setting up multiple threads to run on the same node
- Identify techniques for pinning threads and memory to particular cores and sockets
- Define the term interleaved memory
Prerequisites
There are no prerequisites for this topic. However, before you attempt to optimize your code for any particular HPC system, you should become familiar with the specific architecture and hardware characteristics of that system. You should also be aware of profiling and debugging tools that are available for it.
In the case of Frontera at TACC, good references would be the Virtual Workshop roadmaps on Parallel Programming Concepts and Vectorization. But it is not necessary to complete the roadmaps on those topics before diving into this one.