Welcome. This post is part of a series of Continuum
Analytics Open Notebooks showcasing our projects,
products, and services.
In this Continuum Open Notebook, you’ll learn more about how Numba works and
how it reduces your programming effort, and see that it achieves comparable
performance to C and Cython over a range of benchmarks.
If you are reading the blog form of this notebook, you can run the code yourself
on our cloud-based Python-in-the-browser app, Wakari.
Wakari gives you a full Scientific Python stack, right from your browser, and
allows you to write and share your own IPython Notebooks. Sign up for free
How Does Numba Work?
Numba is a Continuum Analytics-sponsored
open source project. Numba’s job
is to make Python + NumPy code as fast as its C and Fortran
equivalents without sacrificing any of the power and flexibility of Python.
Python can be slower than C and Fortran because it features a generic, dynamic
object system. If you were to look at the Python source code in C, you would
see that every object, even simple integer constants, live in large, generic
PyObject structures. The Python interpreter has to unwind several layers of
abstraction each time it operates on a generic object. Let’s consider a simple
statement to demonstrate this concept:
c = a+b
We’ll assume that a and b are both floating-point numbers. Adding them
together is a single instruction on any modern CPU. This statement in C or
Fortran will usually generate just this single floating-point add instruction at
compile-time. At run-time, dispatching this instruction will likely only
require a single CPU cycle, and it will complete in less than five cycles.
The same statement in Python will generate dozens of instructions. Because a
and b are dynamically typed, the interpreter must first determine the type of
a and b, which will require lookups to memory of the a and b types.
Then the interpreter has to determine if the type possesses the add method. A
new object, c, may need to be created. The creation of c requires a memory
allocation on the heap. Finally, the floating-point add operation is called,
and the result is stored in c. The many additional function calls are
responsible for the first order of magnitude of difference in performance
between Python and compiled languages such as C and Fortran. But it is the
memory allocations and dereferencing that are responsible for the next several
orders of magnitude of performance difference. Python does not feature a native
just-in-time compiler, so every time it sees this statement again (such as in a
for loop), it has to repeat all the work it just did.
Numba is our bridge between Python and performance. Numba takes over for the
Python interpreter on decorated functions and classes, and intelligently adds
type information to as many objects as possible in an expression. When Numba
can’t figure out what type an object is, it falls back to the same expensive
type queries the Python interpreter uses. Numba then compiles the Python and
NumPy functions and classes into performant code. Numba can compile just-in-time
with the autojit decorator, or ahead of time with the jit decorator.
This notebook provides a benchmark comparison between Python, C interfaced
through ctypes, Cython, and Numba - all from an IPython notebook in the cloud
that you can run yourself!
The notebook is self-validating, with integrated
tests checking the correctness of each kernel function before timing it. We
encourage you to experiment with the code, try out new ideas, or even improve
the code performance or the benchmarks themselves. Feel free to reuse any of
this code for your own work.
We start by importing the libraries we need and defining a plotting function.
We also install an IPython extension, cmagic, for compiling C code using the
same compiler and flags that were used to build Python. By default,
we have hidden some of the longer code snippets. Click on the title to
view them inline.
Our first benchmark is a simple loop calculating a vector sum over $N$ values.
This is a native NumPy function, so we’ll define that first.
We have to write the same loop explicitly in Python.
Note that the Python code does not require us to specify what y is, beyond
that it must be indexable. Although the Python dynamic types are flexible,
they are not performant.
Next, we define the Numba code.
With a single line of code, we create a high-performance but equally flexible
version of python_sum. When numba_sum is called with a numpy ndarray
object, numba_sum will execute at the same speed as C or Cython. Don’t
believe me? Let’s time it!
Note: We set the func_name attribute to numba_sum to distinguish it from
python_sum, the func_name inherited by default.
Here’s the C code we will compare against. See the notebook for
details on how it is interfaced using magic functions.
Notice that because C is statically typed, we have to state ahead of time what
the contents of y are.
Next up is Cython.
Cython is an optimising static compiler for Python. This
Cython code will generate a Python extension
module. Note that the
Cython language is neither C nor Python, but a creole constructed from the two
languages. Again, see the notebook for details on how this code is
compiled and run using magic functions.
Correctly Measuring Performance
We will use the timeit module to handle our performance comparison. timeit
doesn’t have access to any of the variables in our namespace by default, so we
attach and retrieve them from the __main__ module.
Here are the results of our first call to the timer for three of the benchmarks:
The first time we run the Numba code, we notice that it is much slower than C or
Cython. This is a feature. Remember that Numba is a just-in-time compiler;
this means the code is not compiled until the very last moment before execution
(thus the just in time). If we re-run the Numba code a second time:
We see that execution time is consistently much faster. Numba only pays the
cost of compiling once for a given type of function arguments. Numba caches the
results of compilation between function calls and recognizes
that numba_sum has been called with an integer array previously,
saving a recompile!
Let’s see how NumPy, Python, Cython, Numba, and C actually stack up!
Whoa! We’re not going to have time to run Python on big arrays, let’s drop it
from the rest of the comparison.
Cython and Numba both do very well for small arrays, although Numba eventually
loses some performance for very large arrays. Numba is not quite as fast as C
or Cython for very large problems in this case, this will be addressed in an
For loop benchmark (Floating-Point)
This next benchmark demonstrates the true flexibility of Numba. We don’t need
to modify the Numba code at all, we simply pass an array of doubles this
time instead of integers. We have to write new functions with a different type
for y in both the C and the Cython code.
Whoa! Whoa! Easy with the pitchforks and torches! I have delicate skin!
Yes, we know that this problem could be solved by using typedefs and macros,
or templates in C++. But we would still need to have multiple functions, one
for each possible case, and this would quickly explode combinatorially for
combinations of multiple options. Besides, the whole point of this exercise is
to get performance while keeping the developer’s job as simple as possible.
One of Python’s greatest attributes is its support for clean, generic functions.
Numba really shines in supporting generic functions while providing performance
We measure the performance over a range of vector sizes.
From a performance perspective, the different versions are behaving almost
identically for large vectors. Both Cython and Numba really shine for smaller
array size, though, even over NumPy!
Artifical benchmarks always leave us with a minor sense of dissatisfaction,
similar to the feeling we’re left with after eating hot dogs made out of that
unidentifiable bright red meat. Let’s go back to a real application kernel and
consider the impacts of using Numba there.
Next, we autojit the pure Python kernels to create accelerated Numba variants.
If we want maximum performance, we need to autojit the two functions used in
iterating over the loop. We could have removed the functions themselves, but
they help improve the readability of the code. Currently, Numba does not
support inlining (Pull Requests
welcome!), which makes it more challenging to put function calls in innermost
We don’t observe a significant performance difference between the C, Cython, or
Numba kernels. Of course, only one of the three kernels is written in clean,
dynamic, Python :)
In this open notebook, we compared Numba against Cython and C. First, we
explored some simple benchmarks. Then, we returned to the GrowCut example. Our
experiments reveal that Numba performs as well as Cython and native C interfaced
directly into Python. At the same time, the Numba code is clearly the easiest
to understand and write from a Python programmer’s perspective.
We still have much more ground to cover, including how the professional version
of Numba, NumbaPro, can accelerate code on
GPUs. NumbaPro is available as part of
our Anaconda Accelerate product.
At the request of several commenters, here is a test script and benchmarks that we ran on PyPy and Anaconda Python (with Numba). The results are not tuned (I am not a PyPy expert!) so we did not post them in the blog. We’d be happy to look deeper into this with the PyPy developers. While PyPy is not currently installed on Wakari, we are looking at a number of ways we can install and support the PyPy community.