submit to reddit
Stephen Diehl

A Python Compiler for Big Data

Blaze is the next generation of NumPy, Python’s extremely popular array library. At Continuum Analytics we aim to tackle some of the hardest problems in large data analytics with our Python stack of Numba and Blaze, which together will form the basis of a distributed computation and storage system which is simultaneously able to generate optimized machine code specialized to the data being operated on.

Blaze aims to extend the structural properties of NumPy arrays to a wider variety of table and array-like structures that support commonly requested features such as missing values, type heterogeneity, and labeled arrays.

Unlike NumPy, Blaze is designed to handle out-of-core computations on large datasets that exceed the system memory capacity, as well as on distributed and streaming data. Blaze is able to operate on datasets transparently as if they behaved like in-memory NumPy arrays.

We aim to allow analysts and scientists to productively write robust and efficient code, without getting bogged down in the details of how to distribute computation, or worse, how to transport and convert data between databases, formats, proprietary data warehouses, and other silos.

Graph

The core mode of operation for Blaze is a construction of lazy expression graphs, much in the style of Theano. A graph is constructed for each node corresponding to a source of data or a ByteProvider. The behavior is similar to an ORM in that operations over the objects don’t correspond to immediate computations but instead construct the query or execution plan over the data.

Most importantly, the data in Blaze can be imported from a wide variety of sources including on-disk arrays. Together with IOPro, we aim to be able to import data from CSV, Amazon S3, and SQL Databases as seamlessly as if they were local files.

a = open('quarter_numbers.hdf')
b = open('sql://measurements')
c = open('mydata3', dshape('10, 10, int32'))
 
e = a + b * c
e.eval()

Python

These then construct a graph representation of the expression which can be evaluated executed with eval to produce immediate results.

Types

Blaze introduces a richer grammar for describing the structural and value type properties of data. We call this description the “datashape” of the data points, and it forms a superset of NumPy’s dtype and shape descriptors.

Once a graph is evaluated, Blaze attempts to gather all available type and metadata available from the user input to inform better computation selection and scheduling. The compiler converts expressions graph objects into an intermediate form called ATerm, drawn from the StrategoXT project. This intermediate form is roughly a subset of Python expressions but allows the explicit annotation of type and metadata information directly on the AST. The ATerm IR forms the meeting point where both Numba and Blaze can come together to code generation and graph rewriting to produce more efficient kernels.

Arithmetic(
Add
, Array(){dshape("3, int64")}
, Array(){dshape("3, int64")}
){dshape("3, int64")}

Python

Expressions that are not explicitly typed need to be inferred from their usage across the entire graph together or determined at runtime. The core libraries of Blaze will be explicitly annotated with type information so that together with with the type signatures of the operators and functions in question we can use Milner style type inference to allow the end user to omit the explicit declaration of type information as much as possible.

Runtime

Once an efficient execution plan is generated, it is executed by the Blaze runtime. Because our implementation does not explicitly depend on Python, we are able to overcome many of the shortcomings of the Python runtime such as running without the GIL and utilising real threads to dispatch custom Numba kernels running at near C speed without the performance limitations of Python.

One of the primary complaints about NumPy is the inability to mitigate the effects of temporaries and the roundtrips between Python and NumPy. With Blaze we are able to fuse the entire execution into a single dispatch which is more efficient than equivalent sequencing of ufunc objects and allocation of temporaries in Python space.

In addition to faster serial execution, our proprietary products such as NumbaPro will be capable of mapping computations onto a variety of modern hardware such as GPUs to utilize more sophisticated parallelization techniques to further increase the performance of Blaze computations.

Conclusion

One can think of Blaze and Numba as being two complementary parts of the plan to bring Python into the large data analytics world. Together, Blaze and Numba form a compiler-like infrastructure with Blaze as the type system and symbol table to complement Numba’s code generation.

Tags: Blaze Python
submit to reddit
comments powered by Disqus