knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(blbmethods)
Suppose we have a massive, high-dimensional dataset. Working on the whole dataset all at once is not practical, and map reduce does not give us accurate enough results. Hence we turn to bag of little bootstraps (BLB).
This package provides blb implementation for:
Linear Models
Random Forests
Logistic Regression
This vignette provides information on how the blb algorithm is implemented for each of the models, documentation on how to use the provided functions, and provides performance metrics on the package's multicore functions and C++ code.
The first blb implementation of a statistical model provided by this package is linear models. The idea here is to divide a regression dataset into m parts and from each part, create B bootstrap models. We find estimates by reducing our models and confidence intervals by using bootstrap percentile confidence intervals then reducing.
To create a blblm object, you need a dataframe containing the variables you need to fit a linear model. For our example, we will use the R dataset trees. We would like to fit Height + Volume on Girth.
head(trees)
There are a two additional arguments we have to consider here. We must decide the how many subsets to split our data into and how many bootstrap samples we want from each subset. Since trees is a small dataset, we can choose to split it into two and use 100 bootstrap samples.
model <- blblm(Girth ~ Height + Volume, data = trees, m = 2, B = 100)
Here, model_blblm is a blblm object. A blblm object contains two things: a list of the bootstrapped models for all of the subsets and the formula of the model. Using this object, you can perform obtain many of the blb estimates associated with linear models using the provided class functions as described below.
There are many functions that can be called on a blblm object.
Calling print on a blblm object will display the formula associated with the blblm object.
print(model)
Calling sigma on a blblm object will obtain the blb estimate of sigma.
sigma(model)
Setting the confidence argument to true will return a 95\% confidence interval for sigma. You can additionally specify the level of the confidence interval. Confidence intervals found with mean of bootstrap percentile confidence intervals.
sigma(model, confidence = TRUE, level = 0.95)
Calling coef on a blblm object will obtained the blb estimate of the regression coefficients,
coef(model)
Calling confint will return a 95\% confidence interval for our predictor variables. You can additionally specify the level of the confidence interval. Confidence intervals found with mean of bootstrap percentile confidence intervals.
confint(model, level = 0.95)
You can specify a specific variable using the parm argument.
confint(model, parm = 'Volume', level = 0.95)
Calling predict on a blblm object will give us the prediction for the new dataset.
new_dat <- trees[2:4, -1] predict(model, new_data = new_dat)
You can also call for a confidence interval. Confidence intervals found with mean of bootstrap percentile confidence intervals.
predict(model, new_data = new_dat, confidence = TRUE, level = 0.95)
Additionally, you can also use multicore processing for prediction. Specify the number of workers using the nthreads argument.
predict(model, new_data = new_dat, nthreads = 1)
Since this package deals with big data, this package contains many features to help speed up computing.
a <- runif(3e4) b <- runif(3e4) c <- runif(3e4) dat <- data.frame(a, b, c)
Our package allows the use of multicore processing through the package furrr. Simply give the number of workers you want through the argument nthreads.
blblm(c ~ b + a + a*b, data = dat, m = 10, B = 5000, nthreads = 8)
To see the improvements parallel computing brings, let us generate a fairly large dataset. Here, we generate a dataset with three variables, each with 3e4 observations. We can benchmark the performance of spliting ten times with 5,000 bootstrao samples each using a single core and 8 cores.
bench::mark( blblm(c ~ b + a + a * b, data = dat, m = 10, B = 5000, nthreads = 1), blblm(c ~ b + a + a * b, data = dat, m = 10, B = 5000, nthreads = 8), check = FALSE, memory = FALSE, filter_gc = FALSE, relative = FALSE )
As we can see, for high amounts of data, the multicore process improves computational speed.
We can repeat this experiment for now length 4e4
a <- runif(4e4) b <- runif(4e4) c <- runif(4e4) dat <- data.frame(a, b, c)
bench::mark( blblm(c ~ b + a + a * b, data = dat, m = 10, B = 5000, nthreads = 1), blblm(c ~ b + a + a * b, data = dat, m = 10, B = 5000, nthreads = 8), check = FALSE, memory = FALSE, filter_gc = FALSE, relative = FALSE )
Multicore processing again speeds up computational time by a lot.
We can also use multicore processing for prediction as well.
a <- runif(4e4) b <- runif(4e4) c <- runif(4e4) dat <- data.frame(a, b, c) model <- blblm(c ~ b + a + a * b, data = dat, m = 10, B = 5000, nthreads = 8)
a <- runif(200) b <- runif(200) new_dat <- data.frame(a, b) bench::mark( predict(model, new_data = new_dat, nthreads = 1), predict(model, new_data = new_dat, nthreads = 8), check = FALSE, memory = FALSE, filter_gc = FALSE, relative = FALSE )
It does takes some time to create the cores, so using multiple cores is only time efficient for a large group of predictions.
Additionally, to increase computational efficiency, lm.wfit is rewritten in Rcpp Armadillo as lmW and lm2 is written as an alternative to lm1 that calls lmW instead. The lmW code is based off RcppArmadillo's fastLm.R and fastLm.cpp.
The idea mathematically is to simply change the least squares normal equations from $X^tX\hat{\beta}=X^Ty$ to $X^tWX\hat{\beta}=X^TWy$, where $W$ is a diagonal matrix of our weights vector.
To implement this in Rcpp Armadillo, since it is costly to convert our weights vector into matrix $W$, we make use of the /% operator (element-wise cube multiplication) and .each_col() function (repeat operation on each column). Since $y$ is a vector the fastest way to build $Wy$ using $w\%y$, where $w$ is the weights vector. $X^tWX$ has no vector, so we have to use the each_col function paired with the \% operator to make $X^tW$.
We can benchmark the performance of lmW. Here, we generate another dataset of three variables each with length 1e4, and a vector of weights length 1e4.
# create a dataset a <- runif(1e4) b <- runif(1e4) c <- runif(1e4) dat <- data.frame(a, b, c) # create m <- model.frame(b ~ a + c + a * c, dat) X <- model.matrix(b ~ a + c + a * c, m) y <- model.response(m) # create weights w <- as.vector(rmultinom(1, 1e4, rep(1, nrow(X))))
bench::mark( lm.wfit(X, y, w), blbmethods:::lmW(X, y, w), check = FALSE, memory = FALSE, relative = FALSE )
As we can see, lmW is significantly faster than lm.wfit. Since for blb we call this function many times, using lmW will save us a lot of time.
We can benchmark the whole model creating function to show the difference in efficiency between the two functions. Here lm2 denotes the version using lmW.
bench::mark( blbmethods:::lm1(X,y,1e4), blbmethods:::lm2(X,y,1e4), check = FALSE, memory = FALSE, relative = FALSE )
It is apparent that lm2 is much faster.
We can try both again for length 1e5.
# create a dataset a <- runif(1e5) b <- runif(1e5) c <- runif(1e5) dat <- data.frame(a, b, c) # create m <- model.frame(b ~ a + c + a * c, dat) X <- model.matrix(b ~ a + c + a * c, m) y <- model.response(m) # create weights w <- as.vector(rmultinom(1, 1e5, rep(1, nrow(X))))
bench::mark( lm.wfit(X, y, w), blbmethods:::lmW(X, y, w), check = FALSE, memory = FALSE, relative = FALSE )
Again, we also benchmark lm1 and lm2.
bench::mark( blbmethods:::lm1(X,y,1e5), blbmethods:::lm2(X,y,1e5), check = FALSE, memory = FALSE, relative = FALSE )
As we can see, our Rcpp Armadillo function lmW becomes even faster relatively.
This package also contains a blb implementation of the Random Forest algorithm for classification. Since Random Forests are already a bootstrap algorithm, we divide a classification dataset into m portions and create a Random Forest model for each portion with B trees. Note however that since the Random Forest Algorithm does random sampling of predictor variables, it is not reliable to use bootstrap percentile confidence intervals. Instead, we use jackknife-after-bootstrap to do so.
library(kernlab) # for the data spam data(spam)
All that is needed to create a blbrf object is a classification dataset. The argument m specifies the number of subsets to divide the dataset into and the argument B specifies the number of bootstrap trees to use per subset.
model <- blbrf(type ~ ., data = spam, m = 10, B = 500, nthreads = 1)
Printing blbrf model prints the model relationship and classification levels.
print(model)
You can use a blbrf model to predict new data samples. Here, the prediction is based off of the average of the probabilities of each subset.
new_dat <- spam[56:58,-58] predict(model, new_dat, type = "prediction")
You can specify to return the probability of each class.
predict(model, new_dat, type = "probability")
You can also specify to return a confidence interval. Random Forest confidence intervals use jackknife-after-bootstrap, not bootstrap percentile confidence intervals, so a sample size of 20 or greater is needed to calibrate the standard error.
new_dat <- spam[56:76,-58] predict(model, new_dat, type = "CI", level = 0.95)
Finally, there is multicore processing available for predictions. You can specify the number of cores using the nthreads argument.
new_dat <- spam[56,-58] predict(model, new_dat, type = "prediction", nthreads = 1)
Since we are using the package Ranger, which is mostly written in C, there is no need for us to write any parts in C. However, we can give the option for multicore processing.
Package Ranger has a built-in multicore processing function, that we can use to build our model.
bench::mark( blbrf(type ~ ., data = spam, m = 10, B = 5000, nthreads = 1), blbrf(type ~ ., data = spam, m = 10, B = 5000, nthreads = 8), check = FALSE, memory = FALSE, filter_gc = FALSE, relative = FALSE )
We can also use multicore processing for predictions. First, lets create a model with 5000 trees for each split.
model <- blbrf(type ~ ., data = spam, m = 10, B = 5000, nthreads = 8) new_dat <- spam[56,-58]
bench::mark( predict(model, new_dat, type = "prediction", level = 0.95, nthreads = 1), predict(model, new_dat, type = "prediction", level = 0.95, nthreads = 8), check = FALSE, memory = FALSE, filter_gc = FALSE, relative = FALSE )
With just one prediction, we are slightly faster. We can try again for many predictions.
new_dat <- spam[101:800,-58] bench::mark( predict(model, new_dat, type = "prediction", level = 0.95, nthreads = 1), predict(model, new_dat, type = "prediction", level = 0.95, nthreads = 8), check = FALSE, memory = FALSE, filter_gc = FALSE, relative = FALSE )
As we can see for 700 predictions, multicore processing with 8 cores is much faster.
The final method provided by this function is binary logistic regression. The idea here is we take a binary classification dataset and divide it into m parts and create B bootstrapped logistic regression models. We find estimates by reducing our models and confidence intervals by using bootstrap percentile confidence intervals then reducing.
We will use the same spam dataset from Blbrf. However, we will only fit 3 variables to avoid the curse of dimensionality problem from using linear classifiers such as logistic regression. Again, m denotes the number of splits, B the number of bootstrap samples per split, and nthreads the number of workers used in parallelization.
model <- blbglm(type ~ make + address + all, data = spam, m = 2, B = 500, nthreads = 1)
There are many functions that can be called on a blblm object.
Print prints the formula and levels of the blbglm object.
print(model)
You can obtain the coefficients of the model. Note in the case of logistic regression, the coefficients are represented as $log{\frac{p}{1-p}}=\hat{\beta_0}+\hat{\beta_1}X_1+,...,\hat{\beta}_nX_n$, where $p=P(Y=1|X)$. The coefficients can be interpreted as if $\hat{\beta}=1$, when $X$ increases by 1, the probability that $Y=1$ increases by a factor of 10.
coef(model)
You can also obtain the confidence interval for the coefficients of the model. Confidence intervals found with mean of bootstrap percentile confidence intervals.
confint(model, level = 0.95)
A specific predictor variable can also be specified.
confint(model, parm = "make", level = 0.95)
You can predict using new observations.
new_dat <- spam[41:44,-58] predict(model, new_dat)
You can specify to return the probabilities. Note here the probabilities returned is given by $p=\frac{1}{1+e^{-\hat{\beta_0}+\hat{\beta_1}X_1+,...,\hat{\beta}_nX_n}}$, and is classified as class $1$ when $p>0.5$ and class $2$ otherwise.
predict(model, new_dat, type = "probability")
You can also return a confidence interval and bootstrap percentile. Confidence intervals found with mean of bootstrap percentile confidence intervals.
predict(model, new_dat, type = "CI", level = 0.95)
Finally, you can specify using multicore processing using the nthreads argument.
predict(model, new_dat, type = "probability", nthreads = 1)
Again, multicore processing is offered through the package furrr. Simply specify the number of workers using the nthreads argument
blbglm(type ~ make + address + all, data = spam, m = 2, B = 5000, nthreads = 8)
We can check the performance using the spam dataset, which has 4,601 rows. We'll compare the performance of using 8 cores vs 1 core when dividing the dataset in half and getting 5,000 bootstrap samples.
bench::mark( blbglm(type ~ make + address + all, data = spam, m = 2, B = 5000, nthreads = 1), blbglm(type ~ make + address + all, data = spam, m = 2, B = 5000, nthreads = 8), check = FALSE, memory = FALSE, filter_gc = FALSE, relative = FALSE )
As we can see, using 8 cores speeds up performance by a lot. We can try again doubling up the number of splits.
bench::mark( blbglm(type ~ make + address + all, data = spam, m = 4, B = 5000, nthreads = 1), blbglm(type ~ make + address + all, data = spam, m = 4, B = 5000, nthreads = 8), check = FALSE, memory = FALSE, filter_gc = FALSE, relative = FALSE )
As we can see, the performance of 8 cores is even better.
We can also use multicore processing for predictions
model <- blbglm(type ~ make + address + all, data = spam, m = 2, B = 5000, nthreads = 8)
new_dat <- spam[1000:1500,-58] bench::mark( predict(model, new_dat, nthreads = 1), predict(model, new_dat, nthreads = 8), check = FALSE, memory = FALSE, filter_gc = FALSE, relative = FALSE )
It does take some time to create the cores, so this method is only efficient for large sample sizes.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.