knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(lmBootCompare)

Linear regression is one of the most commonly used tools in the field of statistics.

$$\mathbf{y} = X\boldsymbol{\beta} + \boldsymbol{\epsilon}$$

The popularity and versatility of linear models across many disciplines yields a set of unique challenges, one of which being that the theoretical assumptions which make a linear model reasonable may not be met. The sample size of the observed data should either be sufficiently large, or the error distribution should be reasonably normally distributed. Additionally, in the above model formulation, errors are assumed to be independent and identically distributed (i.e. uncorrelated with constant variance). If any of the assumptions listed above are violated, then we may not wish to trust inference based on the ordinary least squares or maximum likelihood estimates of the variance of $\boldsymbol{\beta}$.

Let's take a look at an example. The trans-Alaska Pipeline System moves crude oil from the south to the northern tip of Alaska, and is one of the largest pipeline systems in the world. The National Institutes of Standards and Technology (NIST) collected field measurements of defects in the Alaska pipeline, then remeasured the same defects in the laboratory. Suppose we would like to use the Pipeline data in R to investigate whether field measurements can significantly predict more accurate laboratory measurements. We will ignore the potential batch effect for now.

library(faraway)
data("pipeline")
head(pipeline)
pipeline <- pipeline[,-3]



To determine whether field measurements can significantly predict more accurate laboratory measurements, we fit a simple linear model. According to the summary output, we can be confident that there is a positive association between field measurements and laboratory measurements.

lMod <- lm(Lab~Field,data=pipeline)
sumary(lMod)



We did not, however, check the assumptions which the linear model uses to conduct the above inference. Particularly, based on the plot of residuals vs. fitted values, we cannot be confident that the constant variance of residuals assumption is met (the normality assumption doesn't look great either, though we likely have enough observations to outweigh this).

plot(lMod, c(1,2))



What can we do? How can we be reasonably confident that our inference about $\boldsymbol{\beta}$ is trustworthy? There are many options, including weighted least squares, generalized least squares, and transformation (to name a few). However, these require some familiarity of working with data, and for weighted and generalized least squares a slightly more advanced knowledge of statistics is implied.

The bootstrap allows us to study the variance of any general "plug-in" estimate, including regression coefficients. Instead of searching for a model that is robust to our violated assumptions, we can simply simulate the distribution of $\boldsymbol{\hat{\beta}}$, the estimate of $\boldsymbol{\beta}$. For linear regression models, there are three main ways to generate the bootstrapped sampling distribution of a regression coefficient:


We won't get too deep into the specifics here, as documentation is extensive and cited in the description. The empirical bootstrap and Wild bootstrap are both appropriate under non-constant variance. Empirical bootstrap ignores the residuals entirely, and simulated the joint distribution of $(\mathbf{X},Y)$. Wild bootstrap performs local variance estimation by using an auxiliary multiplier on the model residuals. There are four commonly used multipliers: Mammen's two-point distribution, Mammen's continuous distribution, the Rademacher distribution, and the Standard Normal distribution. However, if the model fit is decent and there is no evidence of non-constant variance, it is most efficient to used model-based resampling. As there are many potential approaches and it may not be immediately clear which is desired, it might be of interest to the user to conduct multiple approaches and compare the results.

The primary purpose of this package is to compare the bootstrapped sampling distributions of linear regression coefficients obtained using the six different methods. There are four main functions in this package to facilitate this.

First we will generate bootsrap samples from the sampling distribution of $\hat{\beta}{Field}$. We generate B=1000 bootstrap replicates of $\hat{\beta}{Field}$ for each of the six bootstrapping methods.

set.seed(123)
bootsamps <- bootstrapSamples( lmodObs = lMod, formula = "Lab ~ Field", data = pipeline, B = 1000 )



Now we will compare the sampling distributions of $\hat{\beta}_{Field}$ generated by each of the six bootstrapping methods. We will start off by comparing the summary statistics for the sampling distributions accross methods:

bootstrapSummary(bootsamps, lMod)


We can see that the summary statistics describing the sampling distributions generated by the residual bootstrap are distinct from the empirical and wild (all multipliers). This is because the residual bootstrap treats the residuals as arising from a common distribution, whereas empirical and wild allow for heteroskedasticity.

We can also visualize the differences in sampling distributions:

coefficientPlots(lMod, bootsamps)



One of the most useful aspects of bootstrapping regression coefficients is the availability of non-parametric confidence intervals. This allows us to check whether our conclusions from the initial inference about $\beta_{Field}$ is trustworthy.

bootstrapConfInt(lMod, bootsamps, 0.95)


As none of the bootstrapped confidence intervals for $\beta_{Field}$ include zero, we can be more confident that the association between field measurements and laboratory measurements is present despite our violated linear model assumptions.



ndelrocco/lmBootCompare documentation built on Dec. 10, 2019, 12:38 p.m.