Why not try compiling a simple code for yourself, in order to see how it runs on either the SKX nodes on Stampede3 or the CLX nodes on Frontera? The following instructions are tailored for Frontera, but they will work on any machine with an Intel SP processor.

First, choose a source code to download: either omp_hello.c or omp_hello.f90. Since you'll want to run this code on a machine with Intel Scalable Processors such as Stampede3 or Frontera, you can either download a source file of interest to your local machine and then transfer to the remote system of interest using tools such as scp or sftp, or you can navigate to the code and then copy-paste it to a file directly on your target machine (assuming you are logged in), e.g.,

$ cat > omp_hello.c
$ ### OR ###
$ cat > omp_hello.f90

Once you have pasted the text, enter ctrl-d to exit from cat.

You're now ready to compile the code by invoking one of the commands from the previous page. Since we're focusing on the Intel Xeon Scalable Processors with the 512-bit vector registers, we're going to compile for the CORE-AVX512 architecture. Assuming you have a Frontera account, you will be able to use an Intel compiler to do this (use icx for C/C++ on Stampede3):

$ icc -xCORE-AVX512 -O3 -qopenmp  omp_hello.c -o omp_hello
$ ### OR ###
$ ifort -xCORE-AVX512 -O3 -qopenmp omp_hello.f90 -o omp_hello

The next goal is to run the code. You can do this by using Slurm's sbatch command to submit the following script to the "development" queue on Frontera for fast turnaround (or to skx-dev on Stampede3). Alternatively, you can enter the very same commands interactively on your target machine, skipping the Slurm options. For example, you may choose to open an interactive session on a Frontera compute node using idev -p development so you can enter the commands directly into a shell there.

If you do submit the above as a batch script (you can name it run_omp_hello.sh, perhaps), remember to edit the account number to change it to one that you are allowed to use. Output from your job will be saved to a file named slurm-######.out.

Note that on the compute nodes, TACC sets the default value of OMP_NUM_THREADS equal to 1. Otherwise, an OpenMP code would try to fork 56 threads by default, or one for each hardware thread, i.e., for each core (as indicated by the call to omp_get_num_procs()). This behavior would create a particular hazard for MPI codes that are also multithreaded with OpenMP. Why? A hybrid code of this type might have multiple MPI processes running on a CLX node; if each process were to follow the OpenMP default of forking 56 threads, the node would quickly become bogged down.

In the exercise, we reset OMP_NUM_THREADS to 56, to match the number of physical cores on a Frontera CLX node. This is often the best choice when a running a single OpenMP process, matching the number of threads to cores, even if hyperthreading is enabled on a node. So we should get 56 hellos, or (in principle) one hello from every core. (On Stampede3, in the skx-dev queue, the appropriate number for OMP_NUM_THREADS would be 48.)

 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Inclusivity Statement