submit to reddit
Benjamin Zaitlen

The Python and The Compiled Python

In Peter Wang’s SciPy Evangelism 101 talk he discussed “The Truth About Good Programmers.” He says good programmers are

  • Lazy
    • (in a good way)
    • just want things to work
  • Temperamental Tinkerers
    • Obsessed with details
    • (Sometimes)
  • Spoiled kids who just want to have fun
    • And sometimes create Fortune 100 companies or discover new physics

Scientific and Financial programmers, I would argue, share a lot of the above qualities, but ultimately are more interested in the financial/scientific questions they want to answer and less so in programming.

For Python programmers of the world — both scientific and non-scientific — the efficiency and speed deficits of pure Python detract from what should otherwise be a happy marriage of language and programmer.

One commonly suggested solution to close the performance gap is to just use statically compiled languages like C or FORTRAN. While these are fast, there is a big tradeoff in agility and ease-of-use, and iteration cycles are almost always longer. The CPython API is also a popular choice — you get the ease-of-use of python and the speed of C. But development time can be considerably longer tha using just the statically compiled language. Additionally, it’s another layer between concept and code.

Auto-Magical Python

Many scientific programmers are aware of tools like Cython, which extends Python syntax with static typing, and weave, which allows embedding C code as large, opaque strings inside Python. Travis Oliphant discussed and profiled some of these approaches last year on his blog, as they pertained to a specific problem. In general, while these approaches can be effective, they still impose a bit of a learning curve, and there is often times no way around having to debug the shim layer between the high and low levels.

This year, we announced the Numba project, a Numpy-aware dynamic/Just-In-Time compiler for pure Python. Numba uses LLVM to transform and compile pure Python (combined with type information from Numpy) to machine code, and can achieve speeds comparable to hand-coded C, C++, and FORTRAN. Let’s use the same setup as Travis’s post from June of last year to see how Numba fares. Because Cython is a popular and obvious solution for speeding up Python computations, we’ll compare Numba to it like others have done.

We want to solve the Diffusion Equation in 2D

\( \frac{\partial{c}}{\partial{t}} = D\nabla^{2}c \)

using a Forward-Euler Approximation:

\(u_{ij}^{n+1} = u_{ij}^n+\mu(\frac{u_{(i+1,j)}^n-2u_{(i,j)}^n+u_{(i-1,j)}^n}{\Delta x^2}+\frac{u_{(i,j+1)}^n-2u_{(i,j)}^n+u_{(i,j-1)}^n}{\Delta y^2})\)

As Travis also points out, we can use some clever slicing and memory views of NumPy arrays to solve the above equation, so let’s add a NumPy based solution to our speed comparisons.

We want to measure how quickly each method of compiled Python — Cython, Numba, NumPy — completed 1000 iterations with increasingly sized 2D arrays. Each method and array size was run 5 times to produce a good measure of the average time.

@jit(argtypes=[double[:,:], double[:,:], int32])
def diffuseNumba(u,tempU,iterNum):
row = u.shape[0]
col = u.shape[1]
for n in range(iterNum):
for i in range(1,row-1):
for j in range(1,col-1):
tempU[i,j] = u[i,j] + mu * (u[i+1,j]-2*u[i,j]+u[i-1,j] +
u[i,j+1]-2*u[i,j]+u[i,j-1] )
for i in range(1,row-1):
for j in range(1,col-1):
u[i,j] = tempU[i,j]
tempU[i,j] = 0.0

Python

cimport numpy as np
def cy_update(np.ndarray[double, ndim=2] u,np.ndarray[double, ndim=2] tempU):
cdef unsigned int i, j
for i in xrange(1,u.shape[0]-1):
for j in xrange(1, u.shape[1]-1):
tempU[i,j] = u[i,j] + 0.1 * (u[i+1,j]-2*u[i,j]+u[i-1,j] +
u[i,j+1]-2*u[i,j]+u[i,j-1] )
 
for i in xrange(1,u.shape[0]-1):
for j in xrange(1, u.shape[1]-1):
u[i,j] = tempU[i,j]
tempU[i,j] = 0.0

Python

def diffuseNumPy(cen, cenXM, cenXP, cenYM, cenYP):
return cen+*mu*(cenXP-4.0*cen+cenXM+cenYP+cenYM)

Python

numba numpy cythonAs you can see from the above plot, NumPy doesn’t do too well overall and starts to choke when NxN arrays pass 2048x2048. While using NumPy is effectively using a compiled library, the temporary array creation required for such a calculation results in diminishing returns as the memory size is increased. I do like the simplicity of the NumPy functions, but Cython and Numba, on the other hand, execute by almost an order-of-magnitude faster. Let’s re-plot without NumPy and zoom-in to see if we are missing features.

numba cython

NumpyTimes
array([[ 1.00000000e+00, 3.60116005e-02, 2.28695327e-04],
[ 2.00000000e+00, 3.80225182e-02, 1.33358063e-04],
[ 4.00000000e+00, 6.97188377e-02, 2.11547181e-04],
[ 8.00000000e+00, 7.22037792e-02, 5.63207113e-04],
[ 1.60000000e+01, 8.22724342e-02, 5.73320663e-04],
[ 3.20000000e+01, 9.85260487e-02, 2.79729861e-04],
[ 6.40000000e+01, 1.70972776e-01, 2.31113206e-04],
[ 1.28000000e+02, 4.69114256e-01, 8.13398566e-04],
[ 2.56000000e+02, 3.16633196e+00, 5.24392911e-02],
[ 5.12000000e+02, 1.88155525e+01, 5.36689859e-02],
[ 1.02400000e+03, 8.61007920e+01, 3.23209449e-01],
[ 2.04800000e+03, 3.52033693e+02, 1.54928561e+01]])

CythonTimes
array([[ 1.00000000e+00, 1.14998817e-03, 1.53380066e-05],
[ 2.00000000e+00, 1.18560791e-03, 2.12711065e-05],
[ 4.00000000e+00, 1.23648643e-03, 2.88474264e-05],
[ 8.00000000e+00, 1.64899826e-03, 1.33451379e-05],
[ 1.60000000e+01, 3.73063087e-03, 8.43932532e-05],
[ 3.20000000e+01, 1.57917976e-02, 2.74106455e-04],
[ 6.40000000e+01, 6.28121853e-02, 2.49787840e-04],
[ 1.28000000e+02, 2.55983782e-01, 1.26665345e-04],
[ 2.56000000e+02, 1.02570543e+00, 4.54792288e-04],
[ 5.12000000e+02, 4.62424855e+00, 3.30890594e-02],
[ 1.02400000e+03, 2.76604324e+01, 1.18625669e-02],
[ 2.04800000e+03, 7.40540400e+01, 1.76062163e-01],
[ 4.09600000e+03, 2.97999561e+02, 6.30987773e-02]])
 
NumbaTimes
array([[ 1.00000000e+00, 3.77769470e-03, 2.92913983e-04],
[ 2.00000000e+00, 3.61261368e-03, 1.33212344e-04],
[ 4.00000000e+00, 3.69253159e-03, 9.86298904e-05],
[ 8.00000000e+00, 4.05659676e-03, 1.03543364e-04],
[ 1.60000000e+01, 6.39796257e-03, 4.54201122e-04],
[ 3.20000000e+01, 1.99831486e-02, 4.31255638e-04],
[ 6.40000000e+01, 7.33105659e-02, 1.27010678e-03],
[ 1.28000000e+02, 2.79729033e-01, 1.29021972e-03],
[ 2.56000000e+02, 1.13603315e+00, 2.54988797e-02],
[ 5.12000000e+02, 4.74292784e+00, 2.67772190e-01],
[ 1.02400000e+03, 2.78597854e+01, 9.87763841e-01],
[ 2.04800000e+03, 8.13986894e+01, 4.81671079e-01],
[ 4.09600000e+03, 3.01944600e+02, 1.90879245e+00]])

Python

Wow, Cython and Numba are neck and neck! So why use Numba over Cython? It’s a one line difference. With Numba I can express my calculation in pure Python and by simply adding a one line decorator, the same calculation takes only a tiny fraction of the original time. Cython, while still a perfectly reasonable solution, requires a little more lifting. And for some, the type specification can be a source of pain. Additionally, the code listed above is just the pyx file. We still need to define a setup.py file for compilation and another file which imports and executes the compiled library. The additional file management and type specification is non-trivial and requires a lot of attention.

NumbaPro

We’ve seen that Numba gives us a simple and elegant method of JIT compiling Python for an optimized runtime. While Continuum supports and maintains Numba’s open-source effort, we have also extended Numba capabilities for enterprise customers — enter NumbaPro. With NumbaPro, users and developers can define array-oriented NumPy UFuncs in Python without having to reach for the CPython API.

UFuncs are the core of Numpy’s array-oriented,or ‘vectorized,’ approach to computation. They are elementwise operations (such as arithmetic operators or sin() and cos()). NumbaPro optimizes NumPy’s vectorized functionality at runtime and provides directives for multi-core and GPU enabled hardware.

UFuncs are generally regarded as the more advanced capabilities of NumPy, so I want to demonstrate a relatively straight-forward example. In a previous blog post, I showed an example of identifying objects from SDSS. The raw data is an image file (FITS) with additional metadata. Applying an image filter/mapping function is a good example of UFuncs and logarithmic scaling is often used to display the full dynamic range of astronomical data. I want to measure the time to apply a log filter to each pixel using NumPy, NumbaPro, and NumbaPro Parallel.

def logThreshold(img_mat):
return 10*log(img_mat+10)+10

Python

log thresholdFrom the above you can see that NumbaPro is extremely fast compared to NumPy. To use NumbaPro we can still define the function in pure Python, but we also have to add 3 directives (also in Python) to instruct NumbaPro to compile the functions.

def logThreshold(img_mat):
return 10*log(img_mat+10)+10
 
bv = Vectorize(logThreshold, backend='bytecode')
bv.add(restype=f, argtypes=[f])
numbapro_log = bv.build_ufunc()

Python

def logThreshold(img_mat):
return 10*log(img_mat+10)+10
 
pv = Vectorize(logThreshold, target='parallel', backend='bytecode')
pv.add(restype=f, argtypes=[f])
numbaproll_log = pv.build_ufunc()

Python

Notice the difference between using single and multi-core features is simply a keyword argument. Pretty simple, no?

Conclusion

NumbaPro leverages Numba’s speed capabilities but adds needed flexibility for defining UFuncs, and makes it extremely easy to parallelize code. At Continuum, we are determined to take tools traditionally accessible only to experts, improve them, then build an interface which everyone can access. NumbaPro is an example of how a low-level, cutting edge technology like LLVM can be turned into an incredibly powerful tool that is accessible even to novices.

Benchmarking code can be downloaded from github

Numba is free, open-source, and currently supported and maintained by Continuum Analytics.

Tags: Accelerate NumbaPro Tools
submit to reddit
comments powered by Disqus