Because vector SIMD parallelism of vector length N results in N computations occurring in parallel, the speedup of a vector operation is ideally equal to the value of N. Vector operations with a vector length of four can potentially gain a 4x speedup over four serial computations. Likewise, doubling the vector size (e.g., from four to eight) results in a doubling of the number of computations that can occur over the same period of time, ideally halving the time it takes to do a given number of computations.

As with any parallel computing construct, however, the effect of vector length on application performance as a whole is not so straightforward. As we will see in a later topic, any real-world application will contain a mixture of vector and non-vector instructions. The proportion of vector instructions present in an application will greatly affect overall application speedup, as will cache efficiency, memory bandwidth and latency, etc.

A trend in modern CPUs is for the vector length to increase with newer architectural revisions. For example, the CPUs on the retired Ranger supercomputer had vector processing units that could operate on 2 double- or 4 single-precision values. The CPUs on the original Stampede doubled this vector length to 4 double- or 8 single-precision values. The Xeon Phi coprocessors on the original Stampede contained vector units in each core that were twice that again: 8 double- or 16 single-precision values. And now, the Knights Landing (KNL) , Skylake (SKX), and Ice Lake (ICX) processors on Stampede2 come with a pair of these vector units per core. So the wider (and duplicated) vector units result in potentially much greater performance when code is vectorized, and conversely, much more to lose when code that could be vectorized is not.

 
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement