submit to reddit
Joseph W. Richards, Ph.D.

wiseRF Use Cases and Benchmarks

In a previous post, we announced the partnership between Continuum Analytics and wise.io to bring fast and memory-efficient machine learning to data scientists and programmers. wiseRF, the implementation of the Random Forest algorithm from wise.io, is currently available on Anaconda. As you’ll see, even this entry tier of wiseRF outperforms the other random forest implementations available in the wild.

In this follow-up post, we show how easy it is to use wiseRF within Python to learn a prediction model on data and generate predictions for future data. In a few lines of code, wiseRF can be deployed to answer questions about complex, noisy, and big data.

We also benchmark the performance of wiseRF on two different data sets, demonstrating that it enjoys an order-of-magnitude advantage in speed over the random forest implementation of scikit-learn in training. This allows data scientists to build workflows that:

  • search for and find the optimal prediction model in an order of magnitude less time,
  • re-fit a prediction model more frequently on streaming data to get the most up-to-date insight into their data, and
  • train random forest models on extremely large data sets where other methods fail.

How to use wiseRF

Here, we demonstrate how to use wiseRF to train a classifier on R.A. Fisher’s famous Iris data set and to use that classifier to predict the label (type of Iris) for each new iris from the lengths and widths of the sepal and petal of each flower. The challenge in this problem is to discover the appropriate boundaries between the three different iris species in the 4-dimensional data. The Iris data is a tiny dataset so we use it here to show the basic functionality and the baseline improvements that you can achieve with wiseRF.

In this demo, we are using scikit-learn version 0.12.1 and wiseRF version 1.1. First we load in the iris data set from scikit-learn and split them into a random 90% training set and 10% testing set:

import sklearn
from sklearn.datasets import load_iris
# Load the data. Sklearn has some convenient methods for this.
data = load_iris()
inds = arange(len(data.data))
 
# Make a synthetic 90% training / 10% testing set
test_i = random.sample(xrange(len(inds)), int(0.1*len(inds)))
train_i = np.delete(inds, test_i)
print "%d instances in training set, %d in test set" \
% (len(train_i), len(test_i))
 
# The training and testing features (X) and classes (y)
X_train = data.data[train_i,:]
y_train = data.target[train_i]
X_test = data.data[test_i,:]
y_test = data.target[test_i]

Python

Now, we can fit a wiseRF random forest model on the training set with a few simple lines of code:

import PyWiseRF
from PyWiseRF import WiseRF
 
# Build a 10-tree classifier and predict on the test set with WiseRF
rf = WiseRF(n_estimators=10)
rf.fit(X_train, y_train)

Python

Average training time = 1.21 ms. In comparison, scikit-learn random forest takes 6.24 ms on a single core.

Once we have fit the model, we can easily predict on the testing data and evaluate the predictive performance of the wiseRF classifier:

# predict classes for the testing data
ypred_test = rf.predict(X_test)
# evaluate accuracy of the classifier over the testing data
 
print "Accuracy score: %0.2f" % rf.score(X_test, y_test)

Python

Accuracy score = 1.00, meaning that 100% of the testing classifications are correct.

To take advantage of multiple cores in training the wiseRF model, simply specify the n_jobs keyword in the rf.fit function. Likewise, setting n_jobs = -1 uses all available cores.

# fit a 1000-tree random forest on 1 core
rf = WiseRF(n_estimators=1000, n_jobs = 1)
 
rf.fit(X_train, y_train)
# fit a 1000-tree random forest on multiple cores
rf_multi = WiseRF(n_estimators=1000, n_jobs = 10)
 
rf_multi.fit(X_train, y_train)

Python

Compute time on a single core = 122.12 ms (555.41 ms in scikit-learn) Compute time on multiple cores = 59.24 ms (1217.49 ms in scikit-learn with n_jobs = 10) Note that for multiple cores, we used 4 hyperthreaded cores (8 virtual cores), employing the common practice of using additional threads to increase CPU utilization.

Benchmarks: wiseRF versus scikit-learn

We use a slightly larger data set to compare the performance of wiseRF to scikit-learn. The MNIST Handwritten Digits data set consists of 70,000 pixelated images of handwritten digits, from 0 through 9, each image measuring 28-by-28 pixels. The classification goal is to predict the true digit from the raw pixel values of an image.

To perform the comparison, we use 63,000 images as training data and a random 7,000 as testing data.

from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
 
# Define training and testing sets
inds = arange(len(mnist.data))
test_i = random.sample(xrange(len(inds)), int(0.1*len(inds)))
train_i = numpy.delete(inds, test_i)
 
X_train = mnist.data[train_i].astype(numpy.double)
y_train = mnist.target[train_i].astype(numpy.double)
 
X_test = mnist.data[test_i].astype(numpy.double)
y_test = mnist.target[test_i].astype(numpy.double)

Python

We time the whole process of training the random forest on the 63k training digits and predicting (& returning an accuracy score) on the 7k testing digits. We do this both for scikit-learn and wiseRF, first for a single core:

# scikit-learn single core, MNIST data
from time import time
from sklearn.ensemble import RandomForestClassifier
 
t1 = time()
 
rf = RandomForestClassifier(n_estimators=10, n_jobs=1)
rf.fit(X_train, y_train)
 
score = rf.score(X_test, y_test)
 
t2 = time()
dt = t2-t1
print "Accuracy: %0.2f\t%0.2f s" % (score, dt)

Python

scikit-learn: Accuracy = 95%, Total training & prediction time = 121.14 s

# wiseRF single core, MNIST data
t1 = time()
rf = WiseRF(n_estimators=10, n_jobs=1)
rf.fit(X_train, y_train)
 
score = rf.score(X_test, y_test)
 
t2 = time()
dt = t2-t1
print "Accuracy: %0.2f\t%0.2f s" % (score, dt)

Python

wiseRF: Accuracy = 94%, Total training & prediction time = 16.89 s

On a single core, wiseRF enjoys a factor of 7 boost in speed over scikit-learn with a comparable accuracy.

On 4 hyperthreaded cores (8 virtual cores), the performance metrics are: scikit-learn: Accuracy = 95%, Total training & prediction time = 49.51 s wiseRF: Accuracy = 94%, Total training & prediction time = 6.61 s giving wiseRF a 7.5x advantage in speed over scikit-learn.

Conclusion

For the two problems shown above, wiseRF is at least 5x faster and sometimes as much as 100x faster than scikit-learn’s random forest, with the factor improvement depending on the number of trees and number of cores used for training. For the MNIST data, wiseRF on a single core outperforms scikit-learn on 8 cores–by a factor of 3–in terms of speed. Additionally, wiseRF shares the data set amongst all cores so it has 1/8th of the memory requirement of scikit-learn on an 8 core machine.

In a future post, we will detail the memory efficiency of wiseRF, and demonstrate that it can train on REALLY big data sets where other random forest implementations fail. This power to train on tremendously large data sets gives you the ability to extract and use the information encoded in ALL of your data. To train on larger data sets, wise.io offers a version of WiseRF that is not limited in any way. With WiseRF Oak, you can build classfiers on millions of instances on your ultrabook and scale to 64+ core machines in the cloud. See our website or contact us to find out more.

Tags: wiseRF
submit to reddit
comments powered by Disqus