Vector Processing Units
Vector processing units (VPUs) perform the actual work of computing a SIMD vector operation in parallel. Practically all of today's server-class Intel processors are equipped with two VPUs per core.
 
        In the past, one of these VPUs would be specialized to do multiplications, while the other would do additions and subtractions. (Other types of vector operations could also be done by the two units.) In more recent generations of Intel processors, both VPUs are capable of executing a fused multiply-add operation. This gives the core greater speed and flexibility in handling vector operations.
    An individual VPU can be conceptualized as N separate floating point units, where each one of the N is wired to read or write to a distinct index within the vector registers
    that are involved in the computation. Suppose we have an AVX-512 CPU with a vector length of 8 doubles. Its vector processing unit is analogous to eight separate floating point units,
    FP0 through FP7. Let's say a vector instruction is issued which says to add the vector in register zmm0 to the vector in register zmm1, and store the result in zmm2. Using the index
    [i] to denote the i-th vector element, this action is equivalent to
FP0 performs zmm0[0] + zmm1[0] → zmm2[0],
FP1 performs zmm0[1] + zmm1[1] → zmm2[1],
...
FP7 performs zmm0[7] + zmm1[7] → zmm2[7].The beauty of vector processing is that all of this happens in parallel.
In Intel Xeon processors, every core is capable of issuing enough instructions so that both of the core's VPUs can be fully utilized by a single thread. Therefore, the SIMD parallelism offered by the core's vector processing units remains distinct from any parallelism that may be achieved by using multiple threads per node.
In fact, SIMD and multithreading should be viewed as complementary parallel capabilities that generally work well together. Vectorization parallelism works within a core, at the instruction level; while multithreading parallelism works across all the cores, at the process (node) level. (Note that multithreading may help vector performance, indirectly: if a process launches more threads than the number of physical cores that are available, it may reduce the chances that any one VPU becomes stalled on a single thread's memory operations.)
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)