Cornell Virtual Workshop > Introduction to Advanced Cluster Architectures > Tools for Tuning

Think Parallel

In the course of this roadmap, we have explored some features of advanced clusters such as Stampede3 and Frontera at TACC, as well as the Intel Xeon Scalable Processors that power those systems. In addition, the 3D scaling exercise allowed us to explore some further implications for performance on these machines. To summarize, when developing or porting your application for use with the SKX processors on Stampede3 or the CLX processors on Frontera, here are the key aspects to consider:

Multi-core processors: 48 cores per SKX node on Stampede3, and 56 cores per CLX node on Frontera
Wider vector units: two extra-wide vector units within each core, supporting special 512-bit instructions
Re-architected caches: private L2 and shared L3 caches on each core, plus a Mesh Interconnect within the processor to support efficient data movement

The same hardware trends that led to these kinds of features are certain to continue. In future generations of Intel processors, we can expect cores to be multiplied instead of clock frequency, and we can expect SIMD parallelism to receive an even bigger emphasis, because of the ongoing need to improve the key metric of flop/s/watt. Applications programmers will therefore want to structure (or restructure) their computations accordingly. The general idea is to maximize parallel computation while minimizing data movement. Programmers will want to create code having the following characteristics:

Scalable: code performance increases in proportion to the number of processes and threads
Multithreaded: shared-memory parallelism is explicitly expressed using OpenMP, e.g.
Vectorizable: any repetitive arithmetic is performed on data that are contiguous in memory
Compute-intensive: data that are loaded into cache will be re-used multiple times

To achieve the above goals, it may well be necessary to transform data structures and loops in your code in order to expose the right kinds of parallelism to the compiler. The tools described in the preceding pages can be helpful for accomplishing that goal; other topics in the Cornell Virtual Workshop are available to help you as well.

This overall approach to programming for advanced cluster architectures is perhaps captured by this quote:

Worry about scaling; worry about vectorization; worry about data locality. [But] nothing's more important than this catch phrase I use: 'think parallel.'' [T]here's no substitute as a programmer for really understanding where the parallelism is... Maybe [you] should think of the problem a little differently, structure the algorithm differently—that's your most powerful tool.

James Reinders, HPCwire

Back