Fortran DO CONCURRENT
Fortran programmers, too, can avail themselves of language-standard features for accelerating their computations on GPUs. Recent compilers—including the nvfortran compiler in the NVIDIA HPC SDK—have the ability to offload DO CONCURRENT constructs automatically so they run on GPU hardware. Since DO CONCURRENT is part of the ISO Fortran 2008 standard, codes that make use of it do not depend on extensions, directives, or third-party libraries beyond the Fortran language itself.
The DO CONCURRENT construct tells the compiler that all the iterations of the DO loop are independent and can be executed concurrently. Therefore, it enables the programmer to express loop-level parallelism, one of the main forms of parallelism in any language. If we consider that many common array computations are often implemented as loops, loop-level parallelism should be built into Fortran array intrinsics, too, assuming the compiler is accompanied by the appropriate libraries. In fact, nvfortran automatically accelerates standard Fortran array intrinsics and array syntax on NVIDIA's ultra-fast tensor cores, on GPUs that are equipped with them.
GPU acceleration of DO CONCURRENT loops is enabled with the -stdpar command-line option to nvfortran. This option gives the compiler permission to parallelize the DO CONCURRENT loops and offload them to the GPU. Exactly as we saw with nvc++, any necessary data movement between host and device memory is managed automatically through CUDA Unified Memory.
If you change your DO loop to DO CONCURRENT, the compiler will parallelize the loop as though no data dependencies exist between different loop iterations—even if such dependencies really do exist. In this situation, the compiled code is likely to produce results that are incorrect. It is your responsibility, as the programmer, to ensure that a given loop is truly safe to be parallelized.
Note that the auto-parallelizing capabilities of NVIDIA compilers do not pertain to GPUs only. It is equally possible for nvfortran to target a multi-core CPU with the following command:
Moreover, it is possible to compile a program to run on either a CPU or a GPU with this command:
If your system has a GPU, the program runs on the GPU; if it doesn't have one, the program runs on the CPU.
Examples of accelerating Fortran with DO CONCURRENT
In this section, we make some recommendations for getting optimal results when expressing parallelism in standard Fortran using DO CONCURRENT.
Nested loops
Applying DO CONCURRENT to a simple DO loop is generally straightforward. However, nested loops are also a common code pattern encountered in HPC applications. An example might look like the following:
Is there some way to express more parallelism here? It turns out to be straightforward to write such patterns with a single DO CONCURRENT statement, as in the following example. Not only is the code easier to read, but the compiler has more information available to it for optimization:
If you compile this code with -stdpar -Minfo, you will see how the compiler performs the parallelization: it collapses the two loops contained in the DO CONCURRENT and parallelizes them together across the thread blocks and threads of the GPU kernel.
Array assignments
The compiler can automatically detect and parallelize any Fortran array assignments that are present in the DO CONCURRENT loop, even ones that are based on array slices. Here is an example:
If we again look at the parallelization message, we see that the compiler first parallelizes the DO CONCURRENT loop across the thread blocks, then parallelizes the array assignment across the threads, using a multi-dimensional kernel.
Data localities with DO CONCURRENT
Another requirement in HPC programs is data privatization. This can be expressed with the standard Fortran DO CONCURRENT statement:
In that statement, the locality-spec value is one of the following:
Each thread of the DO CONCURRENT loop can have its own array by use of the local clause. If you also want its value on first entry, then you can use local_init. In the following example, the compiler will create vals for each processing unit in the outer loop and locate vals in the faster CUDA shared memory.
However, you should be careful when privatizing variables. If the compiler cannot locate them in the shared memory, they would be created in the GPU's global memory, which can sometimes hurt performance. For this reason, it is important to pay attention to the -Minfo output when privatizing variables.
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)