Cornell Virtual Workshop > GPU Migration and Portability > GPU Portability Solutions

C++ stdpar Execution

Standard C++ is often a preferred language for implementing high performance scientific applications. Its attractive has been enhanced by recent revisions of the ISO C++ standard, which introduced capabilities for algorithms to be executed on accelerators. Although this approach may not yield top performance, it can help in striking a balance between programmer productivity and computational efficiency. Quite a few production-quality C++ compilers are available, such as clang++ and its cousins from various providers, or the recently released nvc++ compiler in the NVIDIA HPC SDK.

At the time that the C++17 standard was introduced, the C++ Standard Template Library (STL) experienced a transformation in its suite of algorithms. Among the extensions was a set of execution policies that are intended to apply to various computing architectures, including multi-core x86 systems and GPUs. As a result, most of the existing STL algorithms acquired an additional argument, which is an execution policy. Specifying a policy enables programmers to express the intended parallelism of an algorithm, which can result in performance improvements for the computational task. The execution policies available in C++17 include:


std::execution::seq
std::execution::unseq
std::execution::par
std::execution::par_unseq

The first option forces the algorithm to run sequentially, while the remaining three options allow the algorithm to be executed in parallel, either in SIMD style or as parallel tasks, or possibly both. Currently, only the nvc++ compiler offers support for the C++ parallel algorithms to be offloaded on NVIDIA GPUs. Parallel execution on GPUs is enabled with the -stdpar option.

It should be noted that nvc++ leverages CUDA's Unified Memory mechanisms to handle the automatic data movement between CPU and GPU. Thus, in parallel algorithm invocations, pointers and objects must refer to data in the CUDA-managed CPU heap, rather than the CPU stack or GPU memory, to avoid errors. Even with this limitation, developing code for this type of offloading is largely similar to standard C++ programming.

Back

© | Cornell University | Center for Advanced Computing | Copyright Statement | Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)