In dlmao/blbmethods: Bag of Little Bootstraps for Various Models

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(blbmethods)

Suppose we have a massive, high-dimensional dataset. Working on the whole dataset all at once is not practical, and map reduce does not give us accurate enough results. Hence we turn to bag of little bootstraps (BLB).

This package provides blb implementation for:

Linear Models
Random Forests
Logistic Regression

This vignette provides information on how the blb algorithm is implemented for each of the models, documentation on how to use the provided functions, and provides performance metrics on the package's multicore functions and C++ code.

Blblm

The first blb implementation of a statistical model provided by this package is linear models. The idea here is to divide a regression dataset into m parts and from each part, create B bootstrap models. We find estimates by reducing our models and confidence intervals by using bootstrap percentile confidence intervals then reducing.

Creating a Blblm Model

To create a blblm object, you need a dataframe containing the variables you need to fit a linear model. For our example, we will use the R dataset trees. We would like to fit Height + Volume on Girth.

head(trees)

There are a two additional arguments we have to consider here. We must decide the how many subsets to split our data into and how many bootstrap samples we want from each subset. Since trees is a small dataset, we can choose to split it into two and use 100 bootstrap samples.

model <- blblm(Girth ~ Height + Volume, data = trees, m = 2, B = 100)

Here, model_blblm is a blblm object. A blblm object contains two things: a list of the bootstrapped models for all of the subsets and the formula of the model. Using this object, you can perform obtain many of the blb estimates associated with linear models using the provided class functions as described below.

Blblm Object Functions

There are many functions that can be called on a blblm object.

print

Calling print on a blblm object will display the formula associated with the blblm object.

print(model)

sigma

Calling sigma on a blblm object will obtain the blb estimate of sigma.

sigma(model)

Setting the confidence argument to true will return a 95\% confidence interval for sigma. You can additionally specify the level of the confidence interval. Confidence intervals found with mean of bootstrap percentile confidence intervals.

sigma(model, confidence = TRUE, level = 0.95)

coef

Calling coef on a blblm object will obtained the blb estimate of the regression coefficients,

coef(model)

confint

Calling confint will return a 95\% confidence interval for our predictor variables. You can additionally specify the level of the confidence interval. Confidence intervals found with mean of bootstrap percentile confidence intervals.

confint(model, level = 0.95)

You can specify a specific variable using the parm argument.

confint(model, parm = 'Volume', level = 0.95)

predict

Calling predict on a blblm object will give us the prediction for the new dataset.

new_dat <- trees[2:4, -1]
predict(model, new_data = new_dat)

You can also call for a confidence interval. Confidence intervals found with mean of bootstrap percentile confidence intervals.

predict(model, new_data = new_dat, confidence = TRUE, level = 0.95)

Additionally, you can also use multicore processing for prediction. Specify the number of workers using the nthreads argument.

predict(model, new_data = new_dat, nthreads = 1)

Performance Metrics For Blblm

Since this package deals with big data, this package contains many features to help speed up computing.

Multicore Processing using Furrr

a <- runif(3e4)
b <- runif(3e4)
c <- runif(3e4)
dat <- data.frame(a, b, c)

Our package allows the use of multicore processing through the package furrr. Simply give the number of workers you want through the argument nthreads.

blblm(c ~ b + a + a*b, data = dat, m = 10, B = 5000, nthreads = 8)

To see the improvements parallel computing brings, let us generate a fairly large dataset. Here, we generate a dataset with three variables, each with 3e4 observations. We can benchmark the performance of spliting ten times with 5,000 bootstrao samples each using a single core and 8 cores.

bench::mark(
  blblm(c ~ b + a + a * b, data = dat, m = 10, B = 5000, nthreads = 1),
  blblm(c ~ b + a + a * b, data = dat, m = 10, B = 5000, nthreads = 8),
  check = FALSE, 
  memory = FALSE,
  filter_gc = FALSE,
  relative = FALSE
)

As we can see, for high amounts of data, the multicore process improves computational speed.

We can repeat this experiment for now length 4e4

a <- runif(4e4)
b <- runif(4e4)
c <- runif(4e4)
dat <- data.frame(a, b, c)

bench::mark(
  blblm(c ~ b + a + a * b, data = dat, m = 10, B = 5000, nthreads = 1),
  blblm(c ~ b + a + a * b, data = dat, m = 10, B = 5000, nthreads = 8),
  check = FALSE, 
  memory = FALSE,
  filter_gc = FALSE,
  relative = FALSE
)

Multicore processing again speeds up computational time by a lot.

We can also use multicore processing for prediction as well.

a <- runif(4e4)
b <- runif(4e4)
c <- runif(4e4)
dat <- data.frame(a, b, c)
model <-  blblm(c ~ b + a + a * b, data = dat, m = 10, B = 5000, nthreads = 8)

a <- runif(200)
b <- runif(200)
new_dat <- data.frame(a, b)
bench::mark(
  predict(model, new_data = new_dat, nthreads = 1),
  predict(model, new_data = new_dat, nthreads = 8),
  check = FALSE, 
  memory = FALSE,
  filter_gc = FALSE,
  relative = FALSE
)

It does takes some time to create the cores, so using multiple cores is only time efficient for a large group of predictions.

Weighted Least Squares using Rcpp Armadillo

Additionally, to increase computational efficiency, lm.wfit is rewritten in Rcpp Armadillo as lmW and lm2 is written as an alternative to lm1 that calls lmW instead. The lmW code is based off RcppArmadillo's fastLm.R and fastLm.cpp.

The idea mathematically is to simply change the least squares normal equations from $X^tX\hat{\beta}=X^Ty$ to $X^tWX\hat{\beta}=X^TWy$, where $W$ is a diagonal matrix of our weights vector.

To implement this in Rcpp Armadillo, since it is costly to convert our weights vector into matrix $W$, we make use of the /% operator (element-wise cube multiplication) and .each_col() function (repeat operation on each column). Since $y$ is a vector the fastest way to build $Wy$ using $w\%y$, where $w$ is the weights vector. $X^tWX$ has no vector, so we have to use the each_col function paired with the \% operator to make $X^tW$.

Speed of lmW vs lm.fit

We can benchmark the performance of lmW. Here, we generate another dataset of three variables each with length 1e4, and a vector of weights length 1e4.

# create a dataset
a <- runif(1e4)
b <- runif(1e4)
c <- runif(1e4)
dat <- data.frame(a, b, c)

# create 
m <- model.frame(b ~ a + c + a * c, dat)
X <- model.matrix(b ~ a + c + a * c, m)
y <- model.response(m)

# create weights
w <- as.vector(rmultinom(1, 1e4, rep(1, nrow(X))))

bench::mark(
  lm.wfit(X, y, w),
  blbmethods:::lmW(X, y, w),
  check = FALSE, 
  memory = FALSE,
  relative = FALSE
)

As we can see, lmW is significantly faster than lm.wfit. Since for blb we call this function many times, using lmW will save us a lot of time.

We can benchmark the whole model creating function to show the difference in efficiency between the two functions. Here lm2 denotes the version using lmW.

bench::mark(
  blbmethods:::lm1(X,y,1e4),
  blbmethods:::lm2(X,y,1e4),
  check = FALSE, 
  memory = FALSE,
  relative = FALSE
)

It is apparent that lm2 is much faster.

We can try both again for length 1e5.

# create a dataset
a <- runif(1e5)
b <- runif(1e5)
c <- runif(1e5)
dat <- data.frame(a, b, c)

# create 
m <- model.frame(b ~ a + c + a * c, dat)
X <- model.matrix(b ~ a + c + a * c, m)
y <- model.response(m)

# create weights
w <- as.vector(rmultinom(1, 1e5, rep(1, nrow(X))))

bench::mark(
  lm.wfit(X, y, w),
  blbmethods:::lmW(X, y, w),
  check = FALSE, 
  memory = FALSE,
  relative = FALSE
)

Again, we also benchmark lm1 and lm2.

bench::mark(
  blbmethods:::lm1(X,y,1e5),
  blbmethods:::lm2(X,y,1e5),
  check = FALSE, 
  memory = FALSE,
  relative = FALSE
)

As we can see, our Rcpp Armadillo function lmW becomes even faster relatively.

Blbrf

This package also contains a blb implementation of the Random Forest algorithm for classification. Since Random Forests are already a bootstrap algorithm, we divide a classification dataset into m portions and create a Random Forest model for each portion with B trees. Note however that since the Random Forest Algorithm does random sampling of predictor variables, it is not reliable to use bootstrap percentile confidence intervals. Instead, we use jackknife-after-bootstrap to do so.

Creating a Blbrf Object

library(kernlab)  # for the data spam
data(spam)

All that is needed to create a blbrf object is a classification dataset. The argument m specifies the number of subsets to divide the dataset into and the argument B specifies the number of bootstrap trees to use per subset.

model <- blbrf(type ~ ., data = spam, m = 10, B = 500, nthreads = 1)

Blbrf Object Functions

print.blblrf

Printing blbrf model prints the model relationship and classification levels.

print(model)

predict.blbrf

You can use a blbrf model to predict new data samples. Here, the prediction is based off of the average of the probabilities of each subset.

new_dat <- spam[56:58,-58]
predict(model, new_dat, type = "prediction")

You can specify to return the probability of each class.

predict(model, new_dat, type = "probability")

You can also specify to return a confidence interval. Random Forest confidence intervals use jackknife-after-bootstrap, not bootstrap percentile confidence intervals, so a sample size of 20 or greater is needed to calibrate the standard error.

new_dat <- spam[56:76,-58]
predict(model, new_dat, type = "CI", level = 0.95)

Finally, there is multicore processing available for predictions. You can specify the number of cores using the nthreads argument.

new_dat <- spam[56,-58]
predict(model, new_dat, type = "prediction", nthreads = 1)

Performance Metrics for Blbrf

Since we are using the package Ranger, which is mostly written in C, there is no need for us to write any parts in C. However, we can give the option for multicore processing.

Multicore Processing for Blbrf

Package Ranger has a built-in multicore processing function, that we can use to build our model.

bench::mark(
  blbrf(type ~ ., data = spam, m = 10, B = 5000, nthreads = 1),
  blbrf(type ~ ., data = spam, m = 10, B = 5000, nthreads = 8),
  check = FALSE, 
  memory = FALSE,
  filter_gc = FALSE,
  relative = FALSE
)

We can also use multicore processing for predictions. First, lets create a model with 5000 trees for each split.

model <- blbrf(type ~ ., data = spam, m = 10, B = 5000, nthreads = 8)
new_dat <- spam[56,-58]

bench::mark(
  predict(model, new_dat, type = "prediction", level = 0.95, nthreads = 1),
  predict(model, new_dat, type = "prediction", level = 0.95, nthreads = 8),
  check = FALSE, 
  memory = FALSE,
  filter_gc = FALSE,
  relative = FALSE
)

With just one prediction, we are slightly faster. We can try again for many predictions.

new_dat <- spam[101:800,-58]

bench::mark(
  predict(model, new_dat, type = "prediction", level = 0.95, nthreads = 1),
  predict(model, new_dat, type = "prediction", level = 0.95, nthreads = 8),
  check = FALSE, 
  memory = FALSE,
  filter_gc = FALSE,
  relative = FALSE
)

As we can see for 700 predictions, multicore processing with 8 cores is much faster.

Blbglm

The final method provided by this function is binary logistic regression. The idea here is we take a binary classification dataset and divide it into m parts and create B bootstrapped logistic regression models. We find estimates by reducing our models and confidence intervals by using bootstrap percentile confidence intervals then reducing.

Creating a Blbglm Object

We will use the same spam dataset from Blbrf. However, we will only fit 3 variables to avoid the curse of dimensionality problem from using linear classifiers such as logistic regression. Again, m denotes the number of splits, B the number of bootstrap samples per split, and nthreads the number of workers used in parallelization.

model <- blbglm(type ~ make + address + all, data = spam, m = 2, B = 500, nthreads = 1)

Blbglm Object Functions

There are many functions that can be called on a blblm object.

print

Print prints the formula and levels of the blbglm object.

print(model)

Coef

You can obtain the coefficients of the model. Note in the case of logistic regression, the coefficients are represented as $log{\frac{p}{1-p}}=\hat{\beta_0}+\hat{\beta_1}X_1+,...,\hat{\beta}_nX_n$, where $p=P(Y=1|X)$. The coefficients can be interpreted as if $\hat{\beta}=1$, when $X$ increases by 1, the probability that $Y=1$ increases by a factor of 10.

coef(model)

Confint

You can also obtain the confidence interval for the coefficients of the model. Confidence intervals found with mean of bootstrap percentile confidence intervals.

confint(model, level = 0.95)

A specific predictor variable can also be specified.

confint(model, parm = "make", level = 0.95)

Prediction

You can predict using new observations.

new_dat <- spam[41:44,-58]
predict(model, new_dat)

You can specify to return the probabilities. Note here the probabilities returned is given by $p=\frac{1}{1+e^{-\hat{\beta_0}+\hat{\beta_1}X_1+,...,\hat{\beta}_nX_n}}$, and is classified as class $1$ when $p>0.5$ and class $2$ otherwise.

predict(model, new_dat, type = "probability")

You can also return a confidence interval and bootstrap percentile. Confidence intervals found with mean of bootstrap percentile confidence intervals.

predict(model, new_dat, type = "CI", level = 0.95)

Finally, you can specify using multicore processing using the nthreads argument.

predict(model, new_dat, type = "probability", nthreads = 1)

Performance Metrics for Blbglm

Multicore Processing for Blbglm

Again, multicore processing is offered through the package furrr. Simply specify the number of workers using the nthreads argument

blbglm(type ~ make + address + all, data = spam, m = 2, B = 5000, nthreads = 8)

We can check the performance using the spam dataset, which has 4,601 rows. We'll compare the performance of using 8 cores vs 1 core when dividing the dataset in half and getting 5,000 bootstrap samples.

bench::mark(
  blbglm(type ~ make + address + all, data = spam, m = 2, B = 5000, nthreads = 1),
  blbglm(type ~ make + address + all, data = spam, m = 2, B = 5000, nthreads = 8),
  check = FALSE, 
  memory = FALSE,
  filter_gc = FALSE,
  relative = FALSE
)

As we can see, using 8 cores speeds up performance by a lot. We can try again doubling up the number of splits.

bench::mark(
  blbglm(type ~ make + address + all, data = spam, m = 4, B = 5000, nthreads = 1),
  blbglm(type ~ make + address + all, data = spam, m = 4, B = 5000, nthreads = 8),
  check = FALSE, 
  memory = FALSE,
  filter_gc = FALSE,
  relative = FALSE
)

As we can see, the performance of 8 cores is even better.

We can also use multicore processing for predictions

model <- blbglm(type ~ make + address + all, data = spam, m = 2, B = 5000, nthreads = 8)

new_dat <- spam[1000:1500,-58]
bench::mark(
  predict(model, new_dat, nthreads = 1),
  predict(model, new_dat, nthreads = 8),
  check = FALSE, 
  memory = FALSE,
  filter_gc = FALSE,
  relative = FALSE
)

It does take some time to create the cores, so this method is only efficient for large sample sizes.

dlmao/blbmethods documentation built on Dec. 20, 2021, 12:09 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

dlmao/blbmethods
Bag of Little Bootstraps for Various Models

In dlmao/blbmethods: Bag of Little Bootstraps for Various Models

Blblm

Creating a Blblm Model

Blblm Object Functions

print

sigma

coef

confint

predict

Performance Metrics For Blblm

Multicore Processing using Furrr

Weighted Least Squares using Rcpp Armadillo

Speed of lmW vs lm.fit

Blbrf

Creating a Blbrf Object

Blbrf Object Functions

print.blblrf

predict.blbrf

Performance Metrics for Blbrf

Multicore Processing for Blbrf

Blbglm

Creating a Blbglm Object

Blbglm Object Functions

print

Coef

Confint

Prediction

Performance Metrics for Blbglm

Multicore Processing for Blbglm

R Package Documentation

Browse R Packages

We want your feedback!

dlmao/blbmethods Bag of Little Bootstraps for Various Models

In dlmao/blbmethods: Bag of Little Bootstraps for Various Models

Blblm

Creating a Blblm Model

Blblm Object Functions

print

sigma

coef

confint

predict

Performance Metrics For Blblm

Multicore Processing using Furrr

Weighted Least Squares using Rcpp Armadillo

Speed of lmW vs lm.fit

Blbrf

Creating a Blbrf Object

Blbrf Object Functions

print.blblrf

predict.blbrf

Performance Metrics for Blbrf

Multicore Processing for Blbrf

Blbglm

Creating a Blbglm Object

Blbglm Object Functions

print

Coef

Confint

Prediction

Performance Metrics for Blbglm

Multicore Processing for Blbglm

R Package Documentation

Browse R Packages

We want your feedback!

dlmao/blbmethods
Bag of Little Bootstraps for Various Models