Vector processing units (VPUs) perform the actual work of computing a SIMD vector operation in parallel. Many of today's server-class Intel processors, including those on Stampede2, are equipped with two VPUs per core.

Partial block diagram of an SKX core showing the 2 VPUs in its execution engine.
Partial block diagram of an SKX core showing the 2 VPUs in its execution engine. Source: wikichip.org

In the past, one of these VPUs would be specialized to do multiplications, while the other would do additions and subtractions. (Other types of vector operations could also be done by the two units.) In more recent generations of Intel processors, both VPUs are capable of executing a fused multiply-add operation. This gives the core greater speed and flexibility in handling vector operations.

An individual VPU can be conceptualized as N separate floating point units, where each one of the N is wired to read or write to a distinct index within the vector registers that are involved in the computation. Suppose we have an AVX-512 CPU with a vector length of 8 doubles. Its vector processing unit is analogous to eight separate floating point units, FP0 through FP7. Let's say a vector instruction is issued which says to add the vector in register zmm0 to the vector in register zmm1, and store the result in zmm2. Using the index [i] to denote the i-th vector element, this action is equivalent to

FP0 performs zmm0[0] + zmm1[0] → zmm2[0],
FP1 performs zmm0[1] + zmm1[1] → zmm2[1],
...
FP7 performs zmm0[7] + zmm1[7] → zmm2[7].

The beauty of vector processing is that all of this happens in parallel.

In ICX, SKX, snd KNL processors, every core is capable of issuing enough instructions so that both of the core's VPUs can be fully utilized by a single thread. Therefore, the SIMD parallelism offered by the core's vector processing units remains distinct from any parallelism that may be achieved by using multiple threads per node.

In fact, SIMD and multithreading should be viewed as complementary parallel capabilities that generally work well together. Vectorization parallelism works within a core, at the instruction level; while multithreading parallelism works across all the cores, at the process (node) level. (Note that multithreading may help vector performance, indirectly: if a process launches more threads than the number of physical cores that are available, it may reduce the chances that any one VPU becomes stalled on a single thread's memory operations.)

 
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement