Parallelization Directives
In the previous pages, we introduced the target directive, which offloads the computation to one single thread on the device. That is not really a sensible way to use the GPU, which is designed to handle massive parallelism through tens of thousands of threads. In the following pages, we will cover the OpenMP directives that achieve parallelism on the device, such as teams, distribute, parallel, for/do, and loop.
Before getting into the specifics of the directives, it is worth discussing the GPU architecture of an NVIDIA GPU. The NVIDIA B200 GPUs on Horizon use the Blackwell architecture, which contains 160 Streaming Multiprocessors (SMs). Similarly, the NVIDIA H200 GPUs on Vista use the Hopper architecture, which contains 144 SMs. Inside each SM is a set of CUDA cores capable of running multiple threads simultaneously. In both Blackwell and Hopper, the SMs have 128 CUDA cores for doing FP32 calculations, for example. So achieving parallelism at the levels of both SMs and CUDA cores is important.
In NVIDIA CUDA terminology, work is organized into blocks of threads. Blocks are assigned to SMs; the threads making up a given block execute on the CUDA cores of their assigned SM. OpenMP uses an analogous model, but with slightly different names: a CUDA block corresponds to an OpenMP team, and CUDA threads correspond to OpenMP threads within a team.
| GPU Hardware | CUDA | OpenMP |
|---|---|---|
| SMs | blocks | teams |
| cores | threads | threads |
In the next pages, we will go into more detail on the teams, distribute, parallel, and for/do directives that parallelize work on the device.
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)