Vectorization & Parallelization
Chris Myers, Steve Lantz
Cornell Center for Advanced Computing
Revisions: 12/2018, 1/2018 (original)
Achieving optimal application performance on advanced clusters involves leveraging multiple levels of parallelism, including those provided both by vector units attached to individual CPUs and by the collective operation of multiple CPUs. Understanding vector unit architecture and how to efficiently feed data into those units are important determinants of performance. Vector analysis tools, vectorization compiler options, and code directives to align data to vector registers all play a part in this process, as does analysis of parallel scaling achieved once single-node performance is addressed. Even with all this work by the investigators to characterize and optimize application performance, there still is room for potential improvement based upon the results of these analyses.
Objectives
After you complete this segment, you should be able to:
- Discuss vector efficiency and key vectorization metrics
- Understand how use of array-level syntax can assist compilers in developing efficient code for vector units
- Describe how to configure a compiler to generate a vectorization report to summarize vectorization performance
- Discuss how to use compiler options and code directives to allocate variables in a way that they are efficiently aligned with vector unit boundaries
Prerequisites
There are no specific prerequisites for learning the information contained in this topic. If one is interested in applying these analyses to one's own application code, then access to Intel compilers and the Intel performance analysis tools (VTune, Advisor, Parallel Studio) is required.