Cornell Virtual Workshop > Case Study: Profiling and Optimization on Advanced Cluster Architectures

Vectorization & Parallelization

Chris Myers, Steve Lantz
Cornell Center for Advanced Computing

Revisions: 12/2018, 1/2018 (original)

Achieving optimal application performance on advanced clusters involves leveraging multiple levels of parallelism, including those provided both by vector units attached to individual CPUs and by the collective operation of multiple CPUs. Understanding vector unit architecture and how to efficiently feed data into those units are important determinants of performance. Vector analysis tools, vectorization compiler options, and code directives to align data to vector registers all play a part in this process, as does analysis of parallel scaling achieved once single-node performance is addressed. Even with all this work by the investigators to characterize and optimize application performance, there still is room for potential improvement based upon the results of these analyses.

Objectives

After you complete this segment, you should be able to:

Discuss vector efficiency and key vectorization metrics
Understand how use of array-level syntax can assist compilers in developing efficient code for vector units
Describe how to configure a compiler to generate a vectorization report to summarize vectorization performance
Discuss how to use compiler options and code directives to allocate variables in a way that they are efficiently aligned with vector unit boundaries

Prerequisites

There are no specific prerequisites for learning the information contained in this topic. If one is interested in applying these analyses to one's own application code, then access to Intel compilers and the Intel performance analysis tools (VTune, Advisor, Parallel Studio) is required.