Francesc Alted

BLZ: A data format leveraging the hierarchical memory model

Current computer architectures are quite different from what it used to be 25 years ago. Many things have changed, but probably the most dramatic difference has been the growing gap between CPU and memory speeds, also known as the memory wall. This has had a hard impact in the memory layout of current computers, as can be seen below:

Architecture evolution

As we see, there is no longer a simple architecture consisting of a single layer of persistent storage (hard disk) and another single layer of non-persistent storage (RAM). Instead, today’s computers come with several layers for persistent storage (solid state disk, hard disk and probably network attached disks) and several layers for non-persistent storage (L1, L2 and L3 caches and RAM).

In the era of Big Data, it is very important that applications be aware of such a memory hierarchy so that they can efficiently process data exceeding RAM capacities. Doing so is not easy because the different layers have different properties, namely, different access speeds, different capacities, and, most importantly, different access modes.

Blaze tackling Big Data scenarios

Blaze is a library that is being designed for tackling this problem by allowing handling data resident in the different layers of the memory hierarchy in a transparent way (basically following the NumPy data access paradigm at user level). For this, a new format (internally called BLZ) that is aware of the properties of each memory layer for storing data is being implemented.

In order to better fit the memory hierarchy, the format is hierarchically structured. This is achieved by splitting data among chunks of different sizes depending on the memory layer that is meant to host it more frequently. Each chunk can be compressed, so reducing the amount of space and reducing the required bandwidth to transmit the data chunk along the different memory layers.

Creating BIG datasets with Blaze

But let’s stop blabbing and create a big array with 8 billion elements with Blaze:

In []: import blaze
 
In []: %time z = blaze.zeros('%d, float64' % 8e9, params=blaze.params(storage='zeros.blz'))
CPU times: user 18.64 s, sys: 2.05 s, total: 20.69 s
Wall time: 21.03 s

Python

The above command has created an array of 64 GB in a machine with only 8 GB of RAM available. Of course the trick is that the data has gone to the persistent storage, and not to RAM.

It is also worth noting that the time for generating and storing the whole array is a mere 21s, so that means a speed of 3 GB/s, which is pretty impressive. This is mainly due to the use of compression (less data to store means higher effective I/O throughput). Of course, for achieving such amazing speeds, the compressor has to be extremely fast. Blosc, the compressor used in Blaze, not only compresses very fast, but can also decompress at speeds that can be faster than a memcpy() call on modern systems.

But, let’s get back to our dataset and have a look at what has been saved:

$ ls -Fd zeros.blz
zeros.blz/

Python

Okay, so the array has been persisted on the zeros.blz directory, which has been created anew. Let’s peek into its contents:

$ ls -F zeros.blz
__attrs__ data/ meta/

Python
$ cat zeros.blz/meta/storage 
{"dtype": "float64", "cparams": {"shuffle": true, "clevel": 5}, "chunklen": 131072, "dflt": 0.0, "expectedlen": 8000000000}

Python

Now, let’s have a look at data contents:

$ ls zeros.blz/data/
[clip]
__13433.blp __16868.blp __202.blp __23734.blp __27168.blp __30600.blp __34034.blp __37469.blp __40901.blp __44335.blp __4776.blp __51201.blp __54636.blp __5806.blp __6570.blp

Python

It turns out that Blaze stores data in what we call chunks and super-chunks. Each of these .blp files above are the so-called super-chunks, and they follow the open Bloscpack format. At its turn, each of these super-chunks can host many chunks which follows the Blosc (the compressor) format. And finally, each chunk is composed by several so-called blocks that are the smallest data bucket that can be compressed/decompressed independently of the others.

BLZ layout

Every super-chunk, chunk and block has different sizes that adapts to different layers of the memory hierarchy. For example, a super-chunk has typically a size that is optimized to be stored in a cache that may live in SSD disks, while helping in reducing the amount of inodes and other sources of overhead in the filesystem. On its hand, the chunk size is meant to be efficient for a cache in RAM and for optimizing the I/O to the persistent storage. Finally, the block size is chosen so that it can fit in either L1 or L2 caches, allowing exclusive access to the different Blosc threads for compressing/decompressing several of them in parallel, or just allowing one single block to be decompressed alone in case only a few elements of the chunk are required.

Operating with Blaze objects

Storing data is only part of the equation, but performing calculations with this data (for example, using the NumPy paradigm) would be much more powerful. This is exactly what we are working on right now in Blaze. And for this we will use compiler technologies (including those used in Numba) for implementing computational kernels that are aware of the BLZ data layout, allowing to perform out-of-core (meaning out-of-RAM) computations in an extremely efficient way. Here it is a diagram of how that will work:

BLZ OOC

As soon as these pieces start working together, a very solid foundation will be ready for building optimized out-of-core computational modules. We appreciate any help from the open source community in making this vision a reality.

Our plans for the future

Implementing a new format for persistence is something to take seriously. There are plenty of requirements to fulfill and performance, although important, would not be our main priority. At Continuum we strive to provide a format that is:

  1. Open. A serious data format has to be as open and documented as possible.

  2. Stable. We never want a user to write a dataset with one version and then discover that he cannot read it with future versions of Blaze.

  3. Reliable. We would like to offer safety features so that when, say, a process breaks in the middle of a write operation, the data that has been saved before would still be recoverable.

  4. Scalable. Initially our priority is to define a format that is well defined for a single file-system (such as used for a single nodes or multiple-nodes talking to a filer). However, we have an eye on allowing the format to distribute chunks, or (more likely) super-chunks on different nodes with Blaze scheduling code to this data in the most efficient way.

If you want to know more about the BLZ format you can find more info at: http://blaze.pydata.org/docs/persistence.html

You can find out more about the other open source projects Continuum Analytics is actively developing at: http://www.continuum.io/developer-resources.html

Tags: Blaze Python
comments powered by Disqus