Registers
Vector registers are the storage areas in a CPU core that contain the operands for vector computations, as well as the results. The size of the vector registers determines the level of SIMD instructions that can be supported by a given processor's CPUs. For example, in order to support the AVX instruction set, the vector registers in an Intel processor must be at least 256 bits wide. Such registers can hold four 64-bit doubles or eight 32-bit floats. The image below depicts the various SIMD instruction sets that Intel has supported in its processors over time, and the vector register sizes associated with each.
Intel's "code names" for its Xeon-branded AVX-512 processors are Skylake Xeon (SKX), Ice Lake Xeon (ICX), etc. Formerly, Intel produced a unique line of Xeon Phi processors with the code name Knights Landing (KNL), but these have been discontinued. As we will see, the various Xeon (and Xeon Phi) instruction sets are not identical, but the vector registers in all recent Intel processors are 512 bits wide, allowing them to support AVX-512 instructions. This means 8 double-precision values or 16 single-precision values may be held in a single vector register, where they can take part in 512-bit vector operations.
In all AVX-512 processors, every CPU effectively has 32 of these 512-bit vector registers, which are named zmm0-zmm31.1 For the sake of backward compatibility, AVX-512 CPUs also support the older 128-bit and 256-bit vector widths used by SSE and AVX. (Note, when executing code that contains SSE or AVX vector instructions on some older processors, only a quarter or half of the full width of the 512-bit registers might be used.)
You should be aware that only data that have been "packed" and loaded into the special vector registers can be operated on by the parallel SIMD hardware in a CPU (vector unit). This implies that SIMD instruction sets must contain more than just arithmetic instructions. They must also include instructions that collect data from system memory and pack the data into vector registers using the proper packed format. Also, they must include instructions that unpack vectors from the registers and store them into system memory.
However, if the requested data are already laid out in consecutive memory locations, then they do not need to undergo any special packing or unpacking. Such data can be loaded directly into, or stored from, vector registers. For this reason, a stride-1 (consecutive-access) memory pattern is preferred for vector processing, as it will give the best performance.
1. One reason for having so many architectural registers is that it allows the next set of vector operands to be loaded, while the current vector result is being computed, and the prior result is being stored. This is a form of pipelining. (To facilitate it, and to make loop overhead less, the compiler will often unroll a vectorized loop by several vector increments.) ^
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)