IOPro provides a version of pyodbc containing extra methods that load data directly into NumPy ndarrays. With IOPro, queries execute several times faster, and memory usage can be 10x-16x lower than with default pyodbc and pandas methods. In this blog post I will present graphs and figures showing improved execution time and memory usage from the use of these new methods.
First, I will provide a brief introduction of IOPro.pyodbc extensions. After that, I will introduce the test environment. Then, I will continue by showing the results obtained in both execution time and memory usage, ending the blog entry with some conclusions.
Overview of IOPro.pyodbc Extensions
IOPro.pyodbc extends the Cursor object to offer two alternative methods to fetchmany/fetchall:
- fetchdictarray returns a dictionary of ndarrays, with one entry per column. Results appear in-memory column-wise. It can be seen as a Structure of Arrays (SoA), although dictionary of arrays would be more appropriate in this case.
- fetchsarray returns a structured array, with a composite dtype containing one field per column. Results appear in-memory row-wise. It can be seen as an Array of Structures (AoS).
Both functions can be used to fetch any number of rows (by passing the row count as an argument), or default to fetch all the resulting rows. For a more detailed description, you can check the documentation.
I’ve run all the tests with OS X 10.8.4 on a machine with 32 GiB RAM, using a conda environment with Python 2.7.5, NumPy 1.7.1 and pandas 0.11.0. The DBM used is Postgres 220.127.116.11 (13) through Postgress.app, running locally in the same machine. The driver manager is iODBC and the ODBC driver used is psqlodbcw.
The database itself is pretty simple, a table created like this:
The table is populated with 100,000,000 rows (that is 10ˆ8). The size of the query will be limited in the query itself using LIMIT. The query looks like this:
All the tests have been run independently, by using the same script. The functions look very similar. Here is sample code for one of the functions:
The tests run are called:
- fetchall: pyodbc fetchall method
- fetchdictarray: IOPro.pyodbc fetchdictarray method
- fetchsarray: IOPro pyodbc fetchsarray method
- pandas: pandas sql.read_frame, returns a DataFrame
- dictarray_pandas: IOPro.pyodbc fetchdictarray plus conversion using pandas.DataFrame.from_dict, returns a DataFrame
- sarray_pandas: IOPro.pyodbc fetchsarray returning a DataFrame (Conversion is performed by pandas.DataFrame constructor)
Note that pandas.sql.read_frame, at least in the version used, uses fetchall internally.
So, how much faster are the functions? In order to measure this, I ran tests using different query sizes, and timed using time.time. Here is a graph of the results:
IOPro.pyodbc fetchdictarray and fetchsarray are much faster than the fetchall. Take into account that the underlying database operation is still performed, and that the same amount of data is being sent from the database to the script, so all the extra time was due to the inefficient container for the result.
By taking a look at a log log version, it is easy to see how the speed improvement is a constant factor of the number of rows. This is made evident as the number of rows grows:
As the number of rows grows, the lines appear almost parallel. However, notice that at the end of the graph pandas starts to diverge a bit.
The following bar graph makes it easier to grasp the time difference in execution time for the different methods:
In this last graph it is clear that all tests based on IOPro.pyodbc perform way better than fetchall and pandas.sql.read_frame. It is also evident that all of IOPro.pyodbc take about the same time, including the ones that convert the result to a DataFrame. Conversion from IOPro.pyodbc results to a DataFrame is highly efficient.
To put some numbers, and not taking into account the edge cases, the test shows that IOPro.pyodbc executes in around 45% of the time compared to fetchall. Compared to pandas.sql.read_frame, IOPro.pyodbc takes around 35% of the time.
Note that there is a base time that is the actual time taken by the ODBC driver to perform the queries and get the results, so the speed-up will depend on the quality of the ODBC driver.
Memory footprint of results
In many cases execution time is what is seen as most important performance-wise. However, when memory usage becomes an issue, execution time can go through the roof. When extracting huge datasets memory usage can be a real issue.
Using memory_profiler I will try to extract some useful information about the memory footprint of the different tests. Bear in mind that memory profiling is a tricky business and it is harder to get accurate information on memory usage. All tests are run on 100,000,000 rows.
Note that line_profiler reports memory as MB, when in fact it means MiB. I’ve kept the dumps from the line_profiler intact.
Two things of note here:
- Explicitly deleting the cursor returns quite a bit of memory. This is probably due to some buffers in the ODBC driver being released as soon as the cursor is freed.
- Memory increases by about 13 GiB from the start to the end of the function. From the database schema, 800 MiB (a bit less actually) should be expected (integer and float(23) should take 4 byte each as defined by the postgres documentation). That’s nearly a 16x overhead!
The result size here is still quite large, although it’s better than fetchall. It still weights in at a hefty 8.5 GiB, which is roughly a 10x overhead.
In this case, memory usage for the results comes really close to what we expect, as we get around 870MiB process memory footprint increase at exit, which is more acceptable.
I will skip the line_profile for the other IOPro.pyodbc tests as they are largely similar.
Memory usage during execution
By using the line_profiler, the results are limited to memory usage at a line granularity. That is not good enough, as any memory allocated and released inside a function will not be tracked at all. For example, we know that pandas.sql.read_frame uses fetchall internally, but with the data of the line profiler we could deduce that fetchall has a higher peak memory usage. That can’t be true.
In order to get significative numbers for memory usage during the execution, I turn to memory_profiler‘s function memory_usage. This function allows to periodically poll the memory usage of the current process whilst our test is run.
Running the tests under memory_usage I got the following results:
From this graph we can see how close pandas and fetchall behave until fetchall starts releasing memory. At that point, pandas grows again (probably during data conversion) to reach a significant higher memory usage point. This huge memory usage may be the cause why pandas started to degrade performance as the number of rows approached its maximum.
On the other hand, the IOPro.pyodbc methods keep memory usage much lower. Conversion to a pandas DataFrame doesn’t have a cost memory-wise because pandas uses the memory of the Numpy array to build the DataFrame, no copy performed. This is also why the conversion to a DataFrame was so efficient in time.
Just note that compared to fetchsarray, fetchdictarray is slightly more efficient memory-wise. As you can see there is a slight spike in memory usage in fetchsarray tests. This is due to some internal conversions.
Keep in mind that this is memory usage for the whole process, including the buffers the ODBC driver uses internally.
IOPro is much faster and requires much less memory than standard pyodbc/pandas methods, for the same task and dataset size. The data is still returned in a traditional NumPy format, and should be backwards compatible with your existing Pandas or NumPy code. The reduced memory footprint is a major feature, because now you can execute much larger queries than before, or execute the same queries on a computer with much less memory. If you use Python on the server to run analytics, it means you can fit more parallel processes on the same box.
All functions in IOPro.pyodbc perform about the same. However fetchdictarray has a lower memory peak than fetchsarray. Prefer the former to the later unless the memory layout of sarray is more appropriate for your analysis. In general, fetchdictarray results can be considered a Structure of Arrays (in fact, a dictionary of arrays), while fetchsarray results are like an Array of Structures.
Also of note is that cursors may use a significant amount of memory, so if you are memory constrained, make sure that the cursor is released as soon as it is no longer needed.
You can try IOPro for 30 days for free. If you need longer time to evaluate, please contact our sales team or purchase IOPro from our online store. As with all our commercial offerings, IOPro is free for academics and students. Documentation can be found here. Remember, by purchasing our software, you are helping to support all of our Open Source efforts.Tags: odbc IOPro pandas performance dataframe NumPy comments powered by Disqus