The Intel compiler does not leave you guessing about what it did to optimize your code. If you want to see what got optimized, and how, and why, you can tell it to produce optimization reports during compilation. You do it this way on Frontera (use icx on Stampede3):

$ icc -O3 -xCORE-AVX512 -qopt-zmm-usage=high -qopt-report -qopenmp vector_ma.c -o vector_ma

The resulting optimization report will feature two main sections for each code unit:

  1. Report from: Interprocedural optimizations [ipo] (Inlined functions show up here)
  2. Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]

For Intel as well as GCC compilers, there are flags that let you specify which parts of the optimization report you want to view, and how much detail to include. The most useful portion is often the one pertaining to vectorization:

$ icc -qopt-report=5 -qopt-report-phase=vec [...]
$ gcc -O2 -ftree-vectorize -fopt-info-vec-all [...]

The report for the vectorization phase will tell you whether or not each loop has been vectorized, along with details about the success of the attempt. If you provide a number to -qopt-report, it sets the level of detail in the Intel compiler's report. Values range from 0 to 5; the default is 2 (equivalent to -fopt-info-vec -fopt-info-vec-missed in GCC). At levels above 3, the Intel compiler gives estimates of the speedups for the loops that it vectorizes. For the meanings of the levels, see icc -qopt-report-help.

The best way to get an idea of what you might learn from an optimization report is to have a look at one. To do this, it will be convenient to use the source file from our earlier exercise, vector_ma.c. If you did not do so previously, download or copy-paste the code at that link, transfer it to Stampede2, and then use the command shown at the top of the page to compile it. There is no need to run the code, since we just want to examine the report. After the compiler is done, examine the report in the shell:

$ less vector_ma.optrpt

Here are some questions to think about while you inspect the file (together with the source code):

  1. What reasons did the compiler give for not vectorizing the initialization loop?
  2. Vectorization cannot occur if the result of a loop's current iteration depends on the result of a previous iteration. What loop failed to vectorize for this cause?
  3. Did the main computational loop vectorize as expected?
  4. The default reporting level is 2. Try again with -qopt-report=4. What can you find out about the vector length and estimated speedup for the main loop?
 
©   Cornell University  |  Center for Advanced Computing  |  Copyright Statement  |  Inclusivity Statement