Speeding up python programs with Numba and Numpy

Alexandre Pinto

2018-10-19

When Python is not enough

The Python programming language is a great tool for almost any kind of rapid prototyping and quick development. It has great features such as its high level nature, a syntax with almost human-level readability . Besides, it is cross platform, with a a diverse standard library and it is multi-paradigm, giving a lot of freedom to the programmer which can use different programming paradigms such as object-oriented, functional or procedural as he sees fit. However, sometimes some portion of our system has high performance requirements and thus the speed that Python offers might not be sufficient. So, how can we boost performance without leaving the realm of Python (using for example compiled languages such as C/C++ or JIT/compiled such as JAVA) and when all of our optimizations are not enough anymore ?

A possible solution, among others, is to make use of Numba, a runtime compiler that translates Python code to native instructions, while letting us use the concise and expressiveness power of Python and also achieve native code speed.

Whats is Numba ?

Numba is a library that performs JIT compilation, that is, translates pure python code to optimized machine code at runtime, using the LLVM industry-standard compiler. It is also able to automatically parallelize loops and run them on multiple cores. Numba is cross-platform since it works on different operative systems (Linux, Windows, OSX) and different architectures (x86, x86_64, ppc64le, etc). It is also able to run the same code on a GPU (NVIDIA CUDA or AMD ROC) and is compatible with Python 2.7 and 3.4-3.7. Overall, the most impressive feature is its simplicity of use since we only need a few decorators to leverage the full power of JIT optimizations.

Numba modes and the @jit decorator

The most important instruction is the @jit decorator. It is this decorator that instructs the compiler which mode to run and with what configurations . Under the hood, the generated bytecode of our decorated functions combined with the arguments that we specify in the decorator, such as the type of the input arguments, are analysed, optimized and finally compiled with the LLVM, generating specially tailored native machine instructions for the CPU currently in use. This compiled version is then reused for each function call.

There are two important modes: nopython and object. The nopython completely avoids the python interpreter and translates the full code to native instructions that can be run without the help of Python . However, if for some reason, that mode is not available (for example, when using unsupported Python features or external libraries) the compilation will fall back to the object mode, where it uses the Python interpreter when it is unable to compile some code . Naturally, the nopython mode is the one who offers the best performance gains.

High-level architecture of Numba

The Numba translation process can be translated in a set of important steps ranging from the Bytecode analysis to the final machine code generation. The picture bellow illustrates this process, where the green boxes correspond to the frontend of the Numba compiler and the blue boxes belong to the backend.

The Numba compiler starts by doing an extensive analysis on the byecode of the desired function(s). This step produces a graph describing the possible flow of executions, called control flow graph (CFG). Based on this graph an analysis on the variable lifetimes are calculated. With these steps finished, the compiler starts translating the bytecode into an intermediate representation (IR) where Numba will perform further optimizations and transformations. Afterwards, type inference, one of the most important steps, is performed. In this step, the compiler will try to infer the type of all variables. Furthermore, if the parallel setting is enabled, the IR code will be transformed to an equivalent parallel version.

If all types are successfully inferred, the Numba IR code is then translated into an efficient LLVM IR code, completely avoiding the Python runtime. However, if the type inference process fails, the LLVM generated code will be slower since it still needs to deal with calls to Python C API. Finally, the LLVM IR code is compiled into native instructions by the LLVM JIT compiler. This optimized machined code is then loaded into memory and reused across multiple calls to the same function(s), making it hundreds of times faster than pure Python.

For debugging purposes, Numba also offers a set of flags that can be enabled in order to see the generated output of the different stages.

import os

os.environ["NUMBA_DUMP_CFG"] = "1"
os.environ["NUMBA_DUMP_IR"] = "1"
os.environ["NUMBA_DUMP_ANNOTATION"] = "1"
os.environ["NUMBA_DEBUG_ARRAY_OPT_STATS"] = "1"
os.environ["NUMBA_DUMP_LLVM"] = "1"
os.environ["NUMBA_DUMP_OPTIMIZED"] = "1"
os.environ["NUMBA_DUMP_ASSEMBLY"] = "1"

Speeding numerical computations: An example

The best use case where we can make use of the Numba library is when we have to do intensive numerical computations. As an example, let's compute the softmax function on a set of 2^16 (65536) random numbers. The softmax function, useful to convert a set of real values into probabilities and commonly used as the last layer in neural networks architectures, is defined as:

\[ \sigma(z_j) = { e^{z_j} \over \sum_{k=1}^{K} e^{z_k} } \]

The following code show two different implementations of this function, a pure python approach (softmax_python) and an optimized version making using of numpy and numba (softmax_optimized):

import time
import math
import numpy as np
from numba import jit


@jit("f8(f8[:])", cache=False, nopython=True, nogil=True, parallel=True)
def esum(z):
    return np.sum(np.exp(z))


@jit("f8[:](f8[:])", cache=False, nopython=True, nogil=True, parallel=True)
def softmax_optimized(z):
    num = np.exp(z)
    s = num / esum(z)
    return s


def softmax_python(z):
    s = []

    exp_sum = 0
    for i in range(len(z)):
        exp_sum += math.exp(z[i])

    for i in range(len(z)):
        s += [math.exp(z[i]) / exp_sum]

    return s


def main():
    np.random.seed(0)
    z = np.random.uniform(0, 10, 10 ** 8)   # generate random floats in the range [0,10)

    start = time.time()
    softmax_python(z.tolist())          # run pure python version of softmax
    elapsed = time.time() - start
    print('Ran pure python softmax calculations in {} seconds'.format(elapsed))

    softmax_optimized(z)                     # cache jit compilation
    start = time.time()
    softmax_optimized(z)                     # run optimzed version of softmax
    elapsed = time.time() - start
    print('\nRan optimized softmax calculations in {} seconds'.format(elapsed))


if __name__ == '__main__':
    main()

The result of the above script will be something similar to:

> python softmax.py

Ran pure python softmax calculations in 41.62234306335449 seconds

Ran optimized softmax calculations in 1.7453773021697998 seconds

These results clearly shows the performance gains obtained when converting our code to something that Numba understands well.

In the softmax_optimized function, there is already the Numba annotation that leverages the full power of JIT optimizations. In fact, under the hood the following bytecode will be analysed, optimized and compiled to native instructions:

> python
import dis
from softmax import esum, softmax_optimized
>>> dis.dis(softmax_optimized)
 14           0 LOAD_GLOBAL              0 (np)
              2 LOAD_ATTR                1 (exp)
              4 LOAD_FAST                0 (z)
              6 CALL_FUNCTION            1
              8 STORE_FAST               1 (num)

 15          10 LOAD_FAST                1 (num)
             12 LOAD_GLOBAL              2 (esum)
             14 LOAD_FAST                0 (z)
             16 CALL_FUNCTION            1
             18 BINARY_TRUE_DIVIDE
             20 STORE_FAST               2 (s)

 16          22 LOAD_FAST                2 (s)
             24 RETURN_VALUE
>>> dis.dis(esum)
  9           0 LOAD_GLOBAL              0 (np)
              2 LOAD_ATTR                1 (sum)
              4 LOAD_GLOBAL              0 (np)
              6 LOAD_ATTR                2 (exp)
              8 LOAD_FAST                0 (z)
             10 CALL_FUNCTION            1
             12 CALL_FUNCTION            1
             14 RETURN_VALUE

We can provide more information about the expected input and output types through the signature. In the example above, the signature "f8[:](f8[:])" is used to specify that the function accepts an array of double precision floats and returns another array of 64-bit floats.

The signature could also use the explicit type names: "float64[:](float64[:])". The general form is type(type, type, ...) which resembles a typical function where argument names are replaced by their types and the function name is replaced by its return type. Numba accepts many different types, which are described bellow:

Type name(s)	Type short name	Description
boolean	b1	represented as a byte
uint8, byte	u1	8-bit unsigned byte
uint16	u2	16-bit unsigned integer
uint32	u4	32-bit unsigned integer
uint64	u8	64-bit unsigned integer
int8, char	i1	8-bit signed byte
int16	i2	16-bit signed integer
int32	i4	32-bit signed integer
int64	i8	64-bit signed integer
intc	–	C
uintc	–	C
intp	–	pointer-sized integer
uintp	–	pointer-sized unsigned integer
float32	f4	single-precision floating-point number
float64, double	f8	double-precision floating-point number
complex64	c8	single-precision complex number
complex128	c16	double-precision complex number

These annotations are easily extended to array forms using [:] , [:, :] or [:, :, :] for 1, 2 and 3 dimensions, respectively.

Final words

Python is a great tool. However, it has some limitations which can be surpassed with the right strategy, making it competitive, in terms of performance, to other compiled languages. Although this post focused on the Numba library, there are other good options (which deserve their own dedicated blog post). Numba also offers other great features not explored here such as integration with NVIDIA CUDA/AMD ROC GPUs, interfacing with C functions and Ahead-of-Time compilation (AOT). Finally, the example source code can be found here.