submit to reddit
Aron Ahmadia

Accelerating Python Libraries with Numba (Part 2)

Introduction

Welcome. This post is part of a series of Continuum Analytics Open Notebooks showcasing our projects, products, and services.

In this Continuum Open Notebook, you’ll learn more about how Numba works and how it reduces your programming effort, and see that it achieves comparable performance to C and Cython over a range of benchmarks.

If you are reading the blog form of this notebook, you can run the code yourself on our cloud-based Python-in-the-browser app, Wakari. Wakari gives you a full Scientific Python stack, right from your browser, and allows you to write and share your own IPython Notebooks. Sign up for free here.

How Does Numba Work?

Numba is a Continuum Analytics-sponsored open source project. Numba’s job is to make Python + NumPy code as fast as its C and Fortran equivalents without sacrificing any of the power and flexibility of Python. Python can be slower than C and Fortran because it features a generic, dynamic object system. If you were to look at the Python source code in C, you would see that every object, even simple integer constants, live in large, generic PyObject structures. The Python interpreter has to unwind several layers of abstraction each time it operates on a generic object. Let’s consider a simple statement to demonstrate this concept:

c = a+b

We’ll assume that a and b are both floating-point numbers. Adding them together is a single instruction on any modern CPU. This statement in C or Fortran will usually generate just this single floating-point add instruction at compile-time. At run-time, dispatching this instruction will likely only require a single CPU cycle, and it will complete in less than five cycles.

The same statement in Python will generate dozens of instructions. Because a and b are dynamically typed, the interpreter must first determine the type of a and b, which will require lookups to memory of the a and b types. Then the interpreter has to determine if the type possesses the add method. A new object, c, may need to be created. The creation of c requires a memory allocation on the heap. Finally, the floating-point add operation is called, and the result is stored in c. The many additional function calls are responsible for the first order of magnitude of difference in performance between Python and compiled languages such as C and Fortran. But it is the memory allocations and dereferencing that are responsible for the next several orders of magnitude of performance difference. Python does not feature a native just-in-time compiler, so every time it sees this statement again (such as in a for loop), it has to repeat all the work it just did.

Numba is our bridge between Python and performance. Numba takes over for the Python interpreter on decorated functions and classes, and intelligently adds type information to as many objects as possible in an expression. When Numba can’t figure out what type an object is, it falls back to the same expensive type queries the Python interpreter uses. Numba then compiles the Python and NumPy functions and classes into performant code. Numba can compile just-in-time with the autojit decorator, or ahead of time with the jit decorator.

Benchmarking

This notebook provides a benchmark comparison between Python, C interfaced through ctypes, Cython, and Numba - all from an IPython notebook in the cloud that you can run yourself! The notebook is self-validating, with integrated tests checking the correctness of each kernel function before timing it. We encourage you to experiment with the code, try out new ideas, or even improve the code performance or the benchmarks themselves. Feel free to reuse any of this code for your own work.

We will start with two variants of a vector sum benchmark, then look at the GrowCut benchmark introduced in the last notebook.

Setup

We start by importing the libraries we need and defining a plotting function. We also install an IPython extension, cmagic, for compiling C code using the same compiler and flags that were used to build Python. By default, we have hidden some of the longer code snippets. Click on the title to view them inline.

%pylab inline
 
# for compiling Cython inline
%load_ext cythonmagic
# for compiling C code inline (you only need to install this once)
%install_ext https://gist.github.com/ahmadia/5600293/raw/f1bb89de5dfa1f2cc67201039b253949529e96a0/cmagic.py
%load_ext cmagic
 
import matplotlib.pyplot as plt
 
# Adjust the SIZEX and SIZEY variables here if you would
# like larger or smaller images.
SIZEX, SIZEY = 10, 10
plt.figsize(SIZEX, SIZEY)
 
from timeit import timeit
from numba import autojit
import math
import ctypes
import numpy as np
import numpy.testing as npt
from matplotlib import ticker
 

def time_and_plot(algo, Ns, dtype, func_names, time_func, xlabel, ylabel, yscale = "linear", yscaling = 1):
"""Plot Timing Comparison
 
Timing Parameters
-----------------
algo : str
Name of the algorithm being timed. Example: "Matrix-Matrix Multiply",
Ns : indexable, int
Size of arguments to be passed to timed functions.
dtype : numpy.dtype
Type of arguments to pass to timed functions.
func_names : str
Name of the functions in the __main__ namespace to be timed.
time_func : function
Timing function. See time_func.
 
Plot Parameters
---------------
xlabel : str
See plt.xlabel.
ylabel : str
See plt.ylabel.
yscale : str, optional, defaults to "linear"
See plt.set_yscale.
yscaling : integer, optional, defaults to 1
Ratio to multiply y data values by.
"""
 
data = np.empty((len(func_names), len(Ns)), dtype=np.float)
for k in xrange(len(func_names)):
for i in xrange(len(Ns)):
data[k,i] = time_func(func_names[k], Ns[i], dtype)
 
plt.clf()
fig1, ax1 = plt.subplots()
w, h = fig1.get_size_inches()
fig1.set_size_inches(w*1.5, h)
ax1.set_xscale("log")
ax1.get_xaxis().set_major_formatter(ticker.FormatStrFormatter("%d"))
ax1.get_xaxis().set_minor_locator(ticker.NullLocator())
ax1.set_yscale(yscale)
plt.setp(ax1.get_xticklabels(), fontsize=14)
plt.setp(ax1.get_yticklabels(), fontsize=14)
 
ax1.grid(color="lightgrey", linestyle="--", linewidth=1, alpha=0.5)
 
plt.xlabel(xlabel, fontsize=24)
plt.ylabel(ylabel, fontsize=24)
plt.xlim(Ns[0]*.9, Ns[-1]*1.1)
plt.suptitle("%s Performance" % (algo), fontsize=24)
 
for k in xrange(len(func_names)):
plt.plot(Ns, data[k,:]*yscaling, "o-", linewidth=2, markersize=5, label=func_names[k])
plt.legend(loc="upper right", fontsize=18)

Python

For loop benchmark (Integer)

Our first benchmark is a simple loop calculating a vector sum over $N$ values. This is a native NumPy function, so we’ll define that first.

def numpy_sum(y):
return y.sum()

Python

We have to write the same loop explicitly in Python.

def python_sum(y):
N = len(y)
x = y[0]
for i in xrange(1,N):
x += y[i]
return x'

Python

Note that the Python code does not require us to specify what y is, beyond that it must be indexable. Although the Python dynamic types are flexible, they are not performant.

Next, we define the Numba code.

numba_sum = autojit()(python_sum)
numba_sum.func_name = "numba_sum"

Python

With a single line of code, we create a high-performance but equally flexible version of python_sum. When numba_sum is called with a numpy ndarray object, numba_sum will execute at the same speed as C or Cython. Don’t believe me? Let’s time it!

Note: We set the func_name attribute to numba_sum to distinguish it from python_sum, the func_name inherited by default.

Here’s the C code we will compare against. See the notebook for details on how it is interfaced using magic functions.

int _c_int_sum(int N, int *y) {
int i;
int x = y[0];
for (i=1; i<N; i++)
{
x += y[i];
}
return x;

C

Notice that because C is statically typed, we have to state ahead of time what the contents of y are.

Next up is Cython.

#cython: boundscheck=False
#cython: nonecheck=False
#cython: wraparound=False
 
def cython_int_sum(int[:] y):
cdef int i
cdef int N = y.shape[0]
cdef int x = y[0]
for i in xrange(1,N):
x += y[i]
return x

Cython

Cython is an optimising static compiler for Python. This Cython code will generate a Python extension module. Note that the Cython language is neither C nor Python, but a creole constructed from the two languages. Again, see the notebook for details on how this code is compiled and run using magic functions.

Correctly Measuring Performance

We will use the timeit module to handle our performance comparison. timeit doesn’t have access to any of the variables in our namespace by default, so we attach and retrieve them from the __main__ module.

def time_sum_func(func_name, N=10000, dtype=np.int32, trials=100):
"""Timeit Helper Function (Simple One-D Functions)
 
Parameters
----------
func_name : str
Name of the function in the __main__ namespace to be timed.
N : int, optional, defaults to 10000
Size of np.ones array to construct and pass to function.
dtype : np.dtype
Type of array to construct and pass to function.
trials : int, optional, defaults to 100
This parameter is passed to timeit
 
Returns
-------
func_time : float
Average execution time over all trials
"""
import __main__
__main__.y = np.ones((N), dtype=dtype)
return (timeit(stmt="func(__main__.y)"",
setup="import __main__; from __main__ import %s as func" % func_name,
number=trials)/trials)

Python

Here are the results of our first call to the timer for three of the benchmarks:

print "Numba (first call) (ms): %g" % (time_sum_func("numba_sum")*1e3)
print "Cython (ms): %g" % (time_sum_func("cython_int_sum")*1e3)
print "C (ms): %g" % (time_sum_func("c_int_sum")*1e3)
 
Numba (first call) (ms): 0.944879
Cython (ms): 0.00568867
C (ms): 0.0138402

Python

The first time we run the Numba code, we notice that it is much slower than C or Cython. This is a feature. Remember that Numba is a just-in-time compiler; this means the code is not compiled until the very last moment before execution (thus the just in time). If we re-run the Numba code a second time:

print "Numba (subsequent) (ms): %g" % (time_sum_func("numba_sum")*1e3)
print "Cython (ms): %g" % (time_sum_func("cython_int_sum")*1e3)
print "C (ms): %g" % (time_sum_func("c_int_sum")*1e3)
 

Numba (subsequent) (ms): 0.0144315
Cython (ms): 0.00579834
C (ms): 0.0180197'

Python

We see that execution time is consistently much faster. Numba only pays the cost of compiling once for a given type of function arguments. Numba caches the results of compilation between function calls and recognizes that numba_sum has been called with an integer array previously, saving a recompile!

Performance Comparison

Let’s see how NumPy, Python, Cython, Numba, and C actually stack up!

Ns = np.logspace(6, 12, num=15, base=2)
func_names = "cython_int_sum", "numba_sum", "c_int_sum", "numpy_sum", "python_sum"
time_and_plot("Simple Integer Sum", Ns, np.int32, func_names,
time_sum_func, "Array Size", "Time: ms", yscale="log", yscaling=1e3)

Python

Whoa! We’re not going to have time to run Python on big arrays, let’s drop it from the rest of the comparison.

Ns = np.logspace(10, 24, num=15, base=2)
func_names = "cython_int_sum", "numba_sum", "c_int_sum", "numpy_sum"
time_and_plot("Simple Integer Sum", Ns, np.int32, func_names, time_sum_func, "Array Size", "Time: s", yscale="log", yscaling=1)

Python

Cython and Numba both do very well for small arrays, although Numba eventually loses some performance for very large arrays. Numba is not quite as fast as C or Cython for very large problems in this case, this will be addressed in an upcoming release.

For loop benchmark (Floating-Point)

This next benchmark demonstrates the true flexibility of Numba. We don’t need to modify the Numba code at all, we simply pass an array of doubles this time instead of integers. We have to write new functions with a different type for y in both the C and the Cython code.

Whoa! Whoa! Easy with the pitchforks and torches! I have delicate skin!

Yes, we know that this problem could be solved by using typedefs and macros, or templates in C++. But we would still need to have multiple functions, one for each possible case, and this would quickly explode combinatorially for combinations of multiple options. Besides, the whole point of this exercise is to get performance while keeping the developer’s job as simple as possible.

One of Python’s greatest attributes is its support for clean, generic functions. Numba really shines in supporting generic functions while providing performance through autojit.

double _c_double_sum(int N, double *y) {
int i;
double x = y[0];
for (i=1; i<N; i++)
{
x += y[i];
}
return x;
}

C

%%cython
 
#cython: boundscheck=False
#cython: nonecheck=False
#cython: wraparound=False
 
def cython_double_sum(double[:] y):
cdef int i
cdef int N = y.shape[0]
cdef double x = y[0]
for i in xrange(1, N):
x += y[i]
return x

Cython

Performance Comparison

We measure the performance over a range of vector sizes.

Ns = np.logspace(6, 12, num=15, base=2)
func_names = "cython_double_sum", "numba_sum", "c_double_sum", "numpy_sum", "python_sum"
time_and_plot("Simple Floating-Point Sum", Ns, np.float, func_names,
time_sum_func, "Vector Size", "Time: ms", yscale="log",
yscaling=1e3)

Python

Ns = np.logspace(10, 24, num=15, base=2)
func_names = "cython_double_sum", "numba_sum", "c_double_sum", "numpy_sum"
time_and_plot("Simple Floating-Point Sum", Ns, np.float, func_names,
time_sum_func, "Vector Size", "Time: (s) Log-Scale", yscale="log", yscaling=1)

Python

From a performance perspective, the different versions are behaving almost identically for large vectors. Both Cython and Numba really shine for smaller array size, though, even over NumPy!

GrowCut benchmark

Artifical benchmarks always leave us with a minor sense of dissatisfaction, similar to the feeling we’re left with after eating hot dogs made out of that unidentifiable bright red meat. Let’s go back to a real application kernel and consider the impacts of using Numba there.

We revisit the GrowCut kernel introduced in the prior entry in this series.

First we do a pure Python implementation.

def window_floor(idx, radius):
if radius > idx:
return 0
else:
return idx - radius
 

def window_ceil(idx, ceil, radius):
if idx + radius > ceil:
return ceil
else:
return idx + radius
 
def python_kernel(image, state, state_next, window_radius):
changes = 0
sqrt_3 = math.sqrt(3.0)
 
height = image.shape[0]
width = image.shape[1]
 
for j in xrange(width):
for i in xrange(height):
 
winning_colony = state[i, j, 0]
defense_strength = state[i, j, 1]
 
for jj in xrange(window_floor(j, window_radius),
window_ceil(j+1, width, window_radius)):
for ii in xrange(window_floor(i, window_radius),
window_ceil(i+1, height, window_radius)):
if (ii == i and jj == j):
continue
 
d = image[i, j, 0] - image[ii, jj, 0]
s = d * d
for k in range(1, 3):
d = image[i, j, k] - image[ii, jj, k]
s += d * d
gval = 1.0 - math.sqrt(s)/sqrt_3
 
attack_strength = gval * state[ii, jj, 1]
 
if attack_strength > defense_strength:
defense_strength = attack_strength
winning_colony = state[ii, jj, 0]
changes += 1
 
state_next[i, j, 0] = winning_colony
state_next[i, j, 1] = defense_strength
 
return changes

Python

Next, we autojit the pure Python kernels to create accelerated Numba variants.

If we want maximum performance, we need to autojit the two functions used in iterating over the loop. We could have removed the functions themselves, but they help improve the readability of the code. Currently, Numba does not support inlining (Pull Requests welcome!), which makes it more challenging to put function calls in innermost loops performantly.

window_ceil  = autojit()(window_ceil)
window_floor = autojit()(window_floor)
numba_kernel = autojit()(python_kernel)
numba_kernel.func_name = "numba_kernel"

Python
%%cython
 
#Note: This code is not in the public domain. This implementation of GrowCut
#is used under the terms of the BSD license, and is available from
#https://github.com/stefanv/growcut_py
 
#cython: cdivision=True
#cython: boundscheck=False
#cython: nonecheck=False
#cython: wraparound=False
 
from __future__ import division
 
import numpy as np
cimport cython
cimport numpy as cnp
 
cdef extern from "math.h" nogil:
double sqrt(double)
 

cdef inline double distance(double[:, :, ::1] image,
Py_ssize_t r0, Py_ssize_t c0,
Py_ssize_t r1, Py_ssize_t c1) nogil:
cdef:
double s = 0, d
int i
 
for i in range(3):
d = image[r0, c0, i] - image[r1, c1, i]
s += d * d
 
return sqrt(s)
 

cdef double s3 = sqrt(3)
 
cdef inline double g(double d) nogil:
return 1 - (d / s3)
 
def cython_kernel(double[:, :, ::1] image, double[:, :, ::1] state, double[:, :, ::1] state_next, Py_ssize_t window_radius):
 
cdef:
Py_ssize_t i, j, ii, jj, width, height
double gc, attack_strength, defense_strength, winning_colony
int changes
 
height, width = image.shape[0], image.shape[1]
 
changes = 0
 
for j in range(width):
for i in range(height):
 
winning_colony = state[i, j, 0]
defense_strength = state[i, j, 1]
 
for jj in xrange(max(0, j - window_radius), min(j + window_radius + 1, width)):
for ii in xrange(max(0, i - window_radius), min(i + window_radius + 1, height)):
if ii == i and jj == j:
continue
 
# p -> current cell, (i, j)
# q -> attacker, (ii, jj)
 
gc = g(distance(image, i, j, ii, jj))
 
attack_strength = gc * state[ii, jj, 1]
 
if attack_strength > defense_strength:
defense_strength = attack_strength
winning_colony = state[ii, jj, 0]
changes += 1
 
state_next[i, j, 0] = winning_colony
state_next[i, j, 1] = defense_strength
 
return changes

Cython

Finally, here’s a pure C version of the GrowCut kernel, ready to interface into Python.

%%c _c_kernel
 
#include "Python.h"
#include "math.h"
 
#define image(row, column, channel) (_image[row*width*3 +column*3 + channel])
#define state(row, column, idx) (_state[row*width*2 +column*2 + idx])
#define state_next(row, column, idx) (_state_next[row*width*2 +column*2 + idx])
 
double distance(double*, Py_ssize_t, Py_ssize_t, Py_ssize_t, Py_ssize_t, Py_ssize_t);
double g(double);
double s3;
 
Py_ssize_t _window_floor(Py_ssize_t, Py_ssize_t);
Py_ssize_t _window_ceil(Py_ssize_t, Py_ssize_t, Py_ssize_t);
 
int _c_kernel(double *_image, double *_state, double *_state_next, Py_ssize_t window_radius, Py_ssize_t width, Py_ssize_t height)
{
Py_ssize_t i, j, ii, jj;
double gc, attack_strength, defense_strength, winning_colony;
int changes;
 
s3 = sqrt(3.0);
changes = 0;
 
for (j=0; j < width; j++) {
for (i =0; i < height; i++) {
 
winning_colony = state(i, j, 0);
defense_strength = state(i, j, 1);
 
for (jj = _window_floor(j, window_radius); jj < _window_ceil(j+1, width, window_radius); jj++) {
for (ii = _window_floor(i, window_radius); ii < _window_ceil(i+1, height, window_radius) ; ii++) {
if (ii == i && jj == j) {
continue;
}
// p -> current cell, (i, j)
// q -> attacker, (ii, jj)
 
gc = g(distance(_image, i, j, ii, jj, width));
 
attack_strength = gc * state(ii, jj, 1);
 
if (attack_strength > defense_strength) {
defense_strength = attack_strength;
winning_colony = state(ii, jj, 0);
changes += 1;
}
state_next(i, j, 0) = winning_colony;
state_next(i, j, 1) = defense_strength;
}
}
}
}
return changes;
}
 
Py_ssize_t _window_floor(Py_ssize_t idx, Py_ssize_t radius)
{
if (radius > idx) {
return 0;
}
else {
return idx - radius;
}
}
 
Py_ssize_t _window_ceil(Py_ssize_t idx, Py_ssize_t ceil, Py_ssize_t radius)
{
if (idx + radius > ceil) {
return ceil;
}
else {
return idx + radius;
}
}
 
double distance(double *_image,
Py_ssize_t r0, Py_ssize_t c0,
Py_ssize_t r1, Py_ssize_t c1,
Py_ssize_t width)
{
double s = 0, d;
int i;
 
for (i=0; i<3; i++) {
d = image(r0, c0, i) - image(r1, c1, i);
s += d * d;
}
return sqrt(s);
}
 
double g(double d)
{
return 1 - (d / s3);
}

C

Performance Comparison

We time the verified kernels over a range of square image sizes. Again, very small images in Python, then larger images in Cython, Numba, and C.

def time_kernel(k_name, N, dtype):
"""Timeit Helper Function (GrowCut Functions)
 
Parameters
----------
k_name : str
Name of the GrowCut kernel in the __main__ namespace to be timed.
N : int
Size of image arrays to construct and pass to function.
dtype : np.dtype
Type of arrays to construct and pass to function.
 
Returns
-------
func_time : float
Average execution time over 3 trials
"""
 
import __main__
image = np.zeros((N, N, 3), dtype=dtype)
state = np.zeros((N, N, 2), dtype=dtype)
state_next = np.empty_like(state)
 
# colony 1 is strength 1 at position 0,0
# colony 0 is strength 0 at all other positions
state[0, 0, 0] = 1
state[0, 0, 1] = 1
 
__main__.image = image
__main__.state = state
__main__.state_next = state_next
 
trials = 3
 
return timeit(stmt="kernel(__main__.image, __main__.state, __main__.state_next, 10)",
setup="from __main__ import %s as kernel; import __main__" % k_name,
number=trials)/trials

Python

Ns = np.logspace(2, 6, num=5, base=2)
 
kernel_names = 'cython_kernel', 'numba_kernel', 'c_kernel', 'python_kernel',
time_and_plot('GrowCut', Ns, np.double, kernel_names, time_kernel,
'Image Size: NxN', 'Time: s', yscale='log', yscaling=1)

Python

Ns = np.logspace(3, 10, num=8, base=2)
 
kernel_names = 'cython_kernel', 'numba_kernel', 'c_kernel'
time_and_plot('GrowCut', Ns, np.float64, kernel_names, time_kernel,
'Image Size: NxN', 'Time: s', yscale='log', yscaling=1)

Python

We don’t observe a significant performance difference between the C, Cython, or Numba kernels. Of course, only one of the three kernels is written in clean, dynamic, Python :)

Conclusion

In this open notebook, we compared Numba against Cython and C. First, we explored some simple benchmarks. Then, we returned to the GrowCut example. Our experiments reveal that Numba performs as well as Cython and native C interfaced directly into Python. At the same time, the Numba code is clearly the easiest to understand and write from a Python programmer’s perspective.

We still have much more ground to cover, including how the professional version of Numba, NumbaPro, can accelerate code on GPUs. NumbaPro is available as part of our Anaconda Accelerate product.

Update:

At the request of several commenters, here is a test script and benchmarks that we ran on PyPy and Anaconda Python (with Numba). The results are not tuned (I am not a PyPy expert!) so we did not post them in the blog. We’d be happy to look deeper into this with the PyPy developers. While PyPy is not currently installed on Wakari, we are looking at a number of ways we can install and support the PyPy community.

Acknowledgements

This notebook features the GrowCut algorithm, which was introduced in “GrowCut” - Interactive Multi-Label N-D Image Segmentation By Cellular Automata” by V. Vezhnevets and V. Konouchine. Credit for the original native Python code and the accelerated Cython code is due to Dr. Nathan Faggian and Dr. Stefan van der Walt. The Cython implementation is used under the terms of the BSD license.

Tags: Numba IPython Notebook Python ImageProcessing
submit to reddit
comments powered by Disqus