Cornell Virtual Workshop > Vectorization > Vector Hardware

Registers

Vector registers are the storage areas in a CPU core that contain the operands for vector computations, as well as the results. The size of the vector registers determines the level of SIMD instructions that can be supported by a given processor's CPUs. For example, in order to support the AVX instruction set, the vector registers in an Intel processor must be at least 256 bits wide. Such registers can hold four 64-bit doubles or eight 32-bit floats. The image below depicts the various SIMD instruction sets that Intel has supported in its processors over time, and the vector register sizes associated with each.

Stampede2 is comprised of three different kinds of compute nodes which contain either Intel Xeon or Intel Xeon Phi processors. The Intel "code names" for these different processors are Skylake Xeon (SKX), Ice Lake Xeon (ICX), and Knights Landing (KNL). Their instruction sets are not identical, but the vector registers in both types of processors are 512 bits wide, allowing them to support AVX-512 instructions. This means 8 double-precision values or 16 single-precision values may be held in a single vector register, where they can take part in 512-bit vector operations.

In all types of Stampede2 processors, every CPU effectively has 32 of these 512-bit vector registers, which are named zmm0-zmm31.¹ For the sake of backward compatibility, Stampede2's CPUs also support the older 128-bit and 256-bit vector widths used by SSE and AVX. When executing code using SSE or AVX vector instructions, only a quarter or half of the full width of the 512-bit registers is used.

You should be aware that only data that have been "packed" and loaded into the special vector registers can be operated on by the parallel SIMD hardware in a CPU (vector unit). This implies that SIMD instruction sets must contain more than just arithmetic instructions. They must also include instructions that collect data from system memory and pack the data into vector registers using the proper packed format. Also, they must include instructions that unpack vectors from the registers and store them into system memory.

However, if the requested data are already laid out in consecutive memory locations, then they do not need to undergo any special packing or unpacking. Such data can be loaded directly into, or stored from, vector registers. For this reason, a stride-1 (consecutive-access) memory pattern is preferred for vector processing, as it will give the best performance.

^{1. One reason for having so many architectural registers is that it allows the next set of vector operands to be loaded, while the current vector result is being computed, and the prior result is being stored.
This is a form of pipelining. (To facilitate it, and to make loop overhead less, the compiler will often unroll a vectorized loop by several vector increments.)
^}

Back