Vector Performance
Steve Lantz with contributions from Aaron Birkland and the Texas Advanced Computing Center
Cornell Center for Advanced Computing and Texas Advanced Computing Center
Revisions: 9/2022, 5/2021, 1/2021, 5/2018, 6/2017, 10/2013 (original)
This section describes several factors that influence real-world performance of vectorized applications. It also suggests tips and techniques that can help you get the best performance from a vectorizable application.
Two basic questions that must be answered when considering vector performance are:
- How much of my application must be vectorized before I will see a significant benefit?
- In the vectorized portions, what hinders the instructions from executing as rapidly as possible?
Objectives
After you complete this topic, you should be able to:
- Explain how Amdahl's Law applies to vectorization
- Apply Amdahl's Law to predict a code's parallel efficiency e given its vectorizable fraction P or vice versa
- Describe how clock frequency inhibits parallel speedup for codes that are both multithreaded and vectorized
- Explain the role of profiling and give examples of the types of profiling that can be done with Intel's Advisor and VTune tools
- Explain how prefetching helps avoid memory latency
- Describe the memory access patterns that lead to improved performance for vectorized code
Prerequisites
- Knowledge of C and/or Fortran, as well as a basic knowledge of what assembly language is
- Familiarity with batch job submission on large compute clusters