Speedup
Because vector SIMD parallelism of vector length N results in N computations occurring in parallel, the speedup of a vector operation is ideally equal to the value of N. Vector operations with a vector length of four can potentially gain a 4x speedup over four serial computations. Likewise, doubling the vector size (e.g., from four to eight) results in a doubling of the number of computations that can occur over the same period of time, ideally halving the time it takes to do a given number of computations.
As with any parallel computing construct, however, the effect of vector length on application performance as a whole is not so straightforward. As we will see in a later topic, any real-world application will contain a mixture of vector and non-vector instructions. The proportion of vector instructions present in an application will greatly affect overall application speedup, as will cache efficiency, memory bandwidth and latency, etc.
A trend in modern CPUs is for the vector length to increase with newer architectural revisions. For example, at TACC, the CPUs on the retired Ranger supercomputer had vector processing units that could operate on 2 double- or 4 single-precision values. CPUs on the original Stampede doubled this vector length to 4 double- or 8 single-precision values. Xeon Phi coprocessors on Stampede, along with the Knights Landing (KNL) processors on Stampede2, provided vector units in each core that were twice that again: 8 double- or 16 single-precision values. Subsequently, the Skylake (SKX) and Ice Lake (ICX) processors that were added to Stampede2, then incorporated into Stampede3, came with a pair of these jumbo-sized vector units per core. The availability of wider (and duplicated) vector units result in potentially much greater performance when code is vectorized, and conversely, much more to lose when code that could be vectorized is not.
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)