Exposing Multiple Layers of Parallelism
As with any type of processor that provides a large number of computational cores, it is more important than ever to create and expose multiple layers of parallelism in one's application, right down to the innermost loops. Full utilization of the VPUs of the SKX and CLX processors is impossible unless the entire code is built around large, simple loops involving vectors of data.
A corollary to this is that due attention must be given to how data are stored, and how they move back and forth from memory to the cores. That way the VPUs, with all their abundant processing power, will not run short of vector operands on which to do their work.
As we will see, the ideal case would be to have each core executing 4 SIMD operations on ~100 floats every cycle. And since SKX and CLX have many cores, there must be many such loops and many such SIMD vectors contained in the program. None of this should come as any surprise, given the architectural trends of mainstream CPUs and GPUs over the years.
Programs for SKX and CLX processors can be written in standard HPC languages such as C/C++ and Fortran, as well as higher-level languages such as Python that utilize C/C++/Fortran for numerically efficient, compiled extension modules that allow for hybrid programs mixing interpreted and compiled code. Thus, common parallelization mechanisms such as MPI and OpenMP are available. Fortunately, it is the compiler's job to take care of vectorization (perhaps guided by a few hints).