Cornell Virtual Workshop > Python for High Performance > Compiling Custom Code

Compilation Frameworks

In the default CPython interpreter, Python source is translated to bytecodes, which are then processed by the interpreter. A variety of tools aim to bypass at least some of this bytecode generation, using compiler technologies to generate either C code that can be subsequently compiled or machine code itself that can interoperate with the CPython interpreter. These include:

Numba: a Python just-in-time (jit) compiler that turns annotated regions of Python code into compiled C code on the fly
Cython: an optimizing static compiler for both the Python programming language and for an extended superset of Python (also called Cython) that introduces C-like language elements such as type declarations to assist with compilation and optimization

While compiled third-party libraries such as NumPy, SciPy and Pandas encourage the use of aggregated, array-level operations instead of Python loops and explicit indexing, the compilation frameworks described here somewhat ironically reverse that advice. While the manipulation of NumPy arrays is generally both compact and performant, it can result in the creation of temporary arrays that can slow down code. By unrolling loops, exposing the innards of an array-based computation, and unleashing Python-aware compilers on that code, performance on array-based codes using Numba or Cython can exceed that of a NumPy-only approach. There are no hard-and-fast rules, however, as to when, if and to what extent a Numba or Cython-based code will outperform a NumPy-only based code, so the need for empirical benchmarking is crucial. In addition, Numba and Cython are not applicable or able to improve performance in all use cases.

Numba

Numba aims to provide a rather minimal interface, enabling jit compilation without much need for additional annotation. It works best with pure Python code, or with Python+NumPy code, in which loops or nested loops have been unrolled. The jit functionality can be applied as is without any restructuring of code, as in the following simple examples that all perform the same operations to produce a new array as a linear combination of three input arrays (x + 2*y + 3*z). The five functions defined in the code block below are:

f1: using NumPy-based array operations
f2: using nested loops and array indexing
f3: jit-compiled version of f2
f4: parallelized version of f3 using prange
f5: parallelized version of f1

import numpy as np
from numba import jit, prange

X, Y, Z = np.random.random((3,1000,1000))

def f1(x,y,z):
    return x + 2*y + 3*z

def f2(x,y,z):
    result = np.empty_like(x)
    for i in range(x.shape[0]):
        for j in range(x.shape[1]):
            result[i,j] = x[i,j] + 2*y[i,j] + 3*z[i,j]
    return result

@jit
def f3(x,y,z):
    result = np.empty_like(x)
    for i in range(x.shape[0]):
        for j in range(x.shape[1]):
            result[i,j] = x[i,j] + 2*y[i,j] + 3*z[i,j]
    return result

@jit(nopython=True, parallel=True)
def f4(x,y,z):
    result = np.empty_like(x)
    for i in prange(x.shape[0]):
        for j in prange(x.shape[1]):
            result[i,j] = x[i,j] + 2*y[i,j] + 3*z[i,j]
    return result

@jit(nopython=True, parallel=True)
def f5(x,y,z):
    return x + 2*y + 3*z

The last three functions are annotated with the decorator @jit, indicating they should be jit compiled. The last two functions are additionally called with the options @jit(nopython=True, parallel=True).

The option nopython=True indicates that numba should compile the decorated function so that it will run entirely without the involvement of the Python interpreter, and throw an exception if it is not able to do so. If this option is not provided, numba will first try to compile in nopython mode, and then default to slower object mode if that is not possible. Performance improvements are most substantial when numba is able to compile in nopython mode.

The option parallel=True indicates that numba should compile the decorated function so that it will run in a multithreaded fashion, utilizing those resources available on your machine.

We can compare the timing information for these five equivalent computations (these numbers can vary significantly depending on processor architecture and current load on a machine):

In [2]: %timeit f1(X,Y,Z)
3.2 ms ± 42.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [3]: %timeit f2(X,Y,Z)
1.06 s ± 14.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit f3(X,Y,Z)
1.83 ms ± 177 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: %timeit f4(X,Y,Z)
1.36 ms ± 95.9 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit f5(X,Y,Z)
1.6 ms ± 95.1 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)

By numba-compiling an unrolled version of the array calculation (f3), we see that we can achieve a substantial speedup as compared to the original, numpy-based function (f1). This is probably at least in part due to the fact that f3 does not involve temporary array creation as in f1; rather, each element of the result array is filled in once through the nested loop. In addition, the availability of multiple cores allows for some speedup due to parallel execution, even for a parallelized version of the numpy-based function (f5).

Cython

Cython requires some more effort to use and to use effectively. Additional tools continue to be developed to lower the barriers to its execution, such as the %%cython cell magic available in ipython and jupyter notebooks (after one has run %load_ext Cython). The Cython superset of the Python language looks like Python with C types added, and in some cases, the resulting code can run roughly as fast as pure C.

The following code example is adapted from the Cython tutorial. It defines a function for computing a specified number of prime numbers. It is almost the same as the equivalent pure Python code for computing primes, but includes a few additional type declarations (for the function input nb_primes, as well as for a few variables used internally). When run through Cython and timed, it can be seen to be approximately 20 times faster than the equivalent Python code.

def primes(int nb_primes):
    cdef int n, i, len_p
    cdef int p[1000]
    if nb_primes > 1000:
        nb_primes = 1000

    len_p = 0  # The current number of elements in p.
    n = 2
    while len_p < nb_primes:
        # Is n prime?
        for i in p[:len_p]:
            if n % i == 0:
                break

        # If no break occurred in the loop, we have a prime.
        else:
            p[len_p] = n
            len_p += 1
        n += 1

    # Let's return the result in a python list:
    result_as_list  = [prime for prime in p[:len_p]]
    return result_as_list

Further information

In a series of blog posts (Numva vs. Cython, Numba vs. Cython: Take 2, and Optimizing Python Code: Numba vs Cython), writers have considered various implementations of a pairwise distance computation, where one wants to produce an N-by-N array of pairwise distances between N points in d dimensions, where the point coordinates are stored in an N-by-d numpy array. Readers interested in seeing NumPy, Numba and Cython compared head-to-head, and in better understanding some of the performance implications of each, might find those posts useful.

In addition, some extended online videos are available describing Numba and Cython in greater detail.

Back