Python has become the most popular—and portable—top-level language for writing programs that are accelerated by GPUs. For artificial intelligence (AI) applications especially, writing code in Python has become almost a requirement. This is largely because of the large, and growing, collection of compiled-code libraries that are meant to be callable from Python. However, tools also exist for converting Python routines to compiled-code equivalents that will run on GPU platforms, especially CUDA. A sampling of some of the more widely used Python-based approaches appears below.

General-purpose GPU programming

There are several powerful tools to help you carry out general-purpose GPU computing (GPGPU) in Python. Generally they have the best support on NVIDIA's CUDA platform.

  • Numba is the go-to tool for coding GPGPU applications in Python. Developers employ a simple decorator syntax to indicate which Python functions are to be "just-in-time" (JIT) compiled to run on the GPU, or CPU. This makes makes Numba especially useful for accelerating numerical computations with minimal code changes. Here is the current set of platforms supported by Numba:
    • CUDA (NVIDIA GPUs): This is the most mature and feature-rich backend for Numba. It enables the developer to write full GPU kernels in Python, using the decorator @cuda.jit. The CUDA target for Numba is now maintained by NVIDIA.
    • ROCm (AMD GPUs): Numba has only experimental support for AMD GPUs via the ROCm platform. Usage is similar to CUDA, but with the @roc.jit target. This backend is still under development, so it may not support all features or hardware.
    • CPUs via LLVM: Numba can compile Python functions to optimized machine code for CPUs using LLVM. This is done using the @jit or @njit decorators. Numba also supports SIMD vectorization, and it can parallelize code across CPU threads using @njit(parallel=True).
  • JAX, developed by Google, combines NumPy-like syntax with automatic differentiation and GPU/TPU acceleration, as well as XLA (Accelerated Linear Algebra). It's particularly popular in cutting-edge research and reinforcement learning (see below). JAX also offers simple primitives for scaling across GPUs. The state of the different backends for JAX is similar to those available for Numba, with CUDA being the most developed.

Other options are even more focused on CUDA, so the resulting code runs very well on NVIDIA hardware but is not directly portable to other GPU platforms. The first two are C-like alternatives to Numba that will work on NVIDIA GPUs, only:

  • PyCUDA provides a more-or-less direct interface to NVIDIA CUDA. It allows you to write CUDA kernels in C-like syntax within Python. Compared to Numba, PyCUDA is more complex, but it lets you manage GPU memory explicitly and offers more fine-grained control over GPU execution.
  • CUDA Python is NVIDIA's own version of the PyCUDA idea. It has the advantage of being "standardized" by NVIDIA; in contrast, PyCUDA is community-based. CUDA Python interoperates with Numba, PyCUDA, and cuPY, as well as many NVIDIA tools.

There are also two CUDA-based alternatives to JAX that will work on NVIDIA GPUs as well as host CPUs. They enable GPU acceleration for array operations with minimal changes to existing NumPy code, which makes them suitable for scientific computing:

  • CuPy is an array library that is meant to be a drop-in replacement for NumPy for computing on NVIDIA GPUs. CuPy leverages advanced CUDA features like cuTENSOR and cuDNN for performance gains. CuPy automatically determines whether to run operations on the CPU or GPU based on where the data resides.
  • cuPyNumeric is again NVIDIA's own version of the cuPy idea. cuPyNumeric builds on the RAPIDS ecosystem (see below) to provide automatic parallelism and acceleration across CPUs and GPUs. According to NVIDIA, this means that with minimal code changes, users can scale their NumPy workflows to run on multiple GPUs, or even supercomputers, to take advantage of distributed parallel computing.
Deep learning and AI frameworks

Python dominates the AI landscape, and nearly all major deep learning frameworks in Python are designed to leverage GPU acceleration. The two most popular support automatic GPU utilization; they also offer APIs for managing devices and memory explicitly.

  • TensorFlow, developed by Google, uses a static computation graph model. It supports GPU execution out of the bos and includes tools like XLA (Accelerated Linear Algebra) and TensorRT for further optimization.
  • PyTorch is known for its dynamic computation graph and intuitive syntax. It allows developers to move tensors and models to the GPU using to('cuda'). It's widely used in academic research as well as in production.
  • JAX belongs in this category, too, as it can be a framework for machine learning research, along with high-performance numerical computing. As mentioned, it combines NumPy-like APIs with automatic differentiation and XLA.
Data science and machine learning
  • RAPIDS is a suite of open-source libraries developed by NVIDIA for GPU-accelerated data science. Its tools are designed to work together and integrate with other Python libraries like Dask for distributed computing. RAPIDS includes:
    • cuDF: A GPU DataFrame library similar to pandas.
    • cuML: A collection of GPU-accelerated machine learning algorithms.
    • cuGraph: Graph analytics on GPUs.
 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)