Alpaka

Alpaka is an open-source, header-only C++ parallelization library. Its API is similar to CUDA, and it operates at a comparable level, functioning as an abstraction layer between the application and the vendor-specific platform. Unlike CUDA, though, Alpaka supports a broad range of underlying programming models from different providers. Its portability derives from its use of C++20 template syntax. Kernel functions are templated with an accelerator type, which is resolved at compile time for different execution backends of the kernel. Since Alpaka is a header-only library, it relies on the compiler to "do the right thing" with the source code in the library, as well as in the application.

One difference between Alpaka and CUDA is that Alpaka adds an abstraction level called "elements" at the bottom of its parallel hierarchy, below threads, which allows each thread to process multiple elements simultaneously. Including this extra level enables compilers to take advantage of the SIMD vector registers when compiling for CPU backends. Thus, a thread block with N threads on a GPU is more likely to vectorize well on a CPU if it is rearranged into 1 thread with N elements. If the CPU has a vector width of NV in the relevant dataype, an even better arrangment might be N/NV threads, each with NV elements.

Coding with Alpaka tends to be similar to CUDA development, since Alpaka has intentionally preserved many concepts from CUDA. However, Alpaka relies heavily on C++ templating, which can lead to more verbose code, as well as more obscure error messages during debugging. Nevertheless, Alpaka can often produce performance close to CUDA after some amount of tuning. Just as in CUDA, some of this optimization comes from making informed choices about how many thread blocks, threads, and elements to specify.

Kokkos

Like Alpaka, Kokkos serves as a single-source C++ template metaprogramming (TMP) library, with the goal of abstracting away the complexities of different vendor-specific programming paradigms. Kokkos provides multiple backends to its abstraction layer, which are implemented as template libraries atop various programming models, including CUDA, HIP, OpenMP, OpenACC, and SYCL. Template specialization allows Kokkos to generate device-specific code, including optimizations, for heterogeneous hardware architectures.

The principal way in which Kokkos departs from Alpaka is in its emphasis on descriptive rather than prescriptive parallelism. Kokkos requires developers to formulate their algorithms in terms of general parallel programming concepts, which it maps to the hardware through the Kokkos framework. The Kokkos programming model starts with a template library (Kokkos::View) for facilitating the abstract representation of multidimensional arrays. This allows the library to manage efficient data layouts for CPUs and GPUs. Data are processed through abstract execution patterns (parallel_for, parallel_reduce, and parallel_scan), each of which may be executed under three distinct execution policies: RangePolicy for a simple parallel loop, MDRangePolicy for nested parallel loops, and TeamPolicy for a hierarchy of nested parallel loops.

Translating CUDA code to Kokkos can be relatively straightforward due to resemblances between the Kokkos and CUDA memory and execution models. However, Kokkos does enforce certain restrictions on C++ template programming which can add to code complexity. Also, achieving efficient execution on both CPU and GPU may require tuning the teamSize and vectorSize in the TeamPolicy, as the Kokkos defaults may be suboptimal.

 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)