Use Case: Linear Regression with Outlier Correction and Bootstrap
In reproducibleRchunks: Automated Reproducibility Checks for R Markdown Documents

This vignette demonstrates how to perform a robust linear regression analysis on the Boston housing data. We start by setting global chunk options and loading the required packages.

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(reproducibleRchunks)
library(MASS)

Load Data

We load the Boston housing data from the MASS package and display the first rows to get an overview of the variables.

boston <- MASS::Boston
knitr::kable(head(boston))

Outlier Correction

Next, we remove observations with extremely high or low values of medv using the IQR rule to reduce the influence of outliers.

iqr_medv <- IQR(boston$medv)
q <- quantile(boston$medv, c(0.25, 0.75))
lower <- q[1] - 1.5 * iqr_medv
upper <- q[2] + 1.5 * iqr_medv
boston <- boston[boston$medv >= lower & boston$medv <= upper, ]

Linear Regression

We then fit a simple linear regression predicting median house value from the percentage of lower status population (lstat).

model <- lm(medv ~ lstat, data = boston)
summary(model)

Bootstrap Coefficients

To assess the variability of the coefficients we perform a simple bootstrap. Because analyses based on random numbers lead to different results across runs, we set a random seed to make this step reproducible. See Brandmaier et al. for a detailed discussion of reproducibility and random numbers (https://qcmb.psychopen.eu/index.php/qcmb/article/view/3763/3763.html).

set.seed(123)
boot_coefs <- replicate(1000, {
  idx <- sample(nrow(boston), replace = TRUE)
  coef(lm(medv ~ lstat, data = boston[idx, ]))
})
boot_means <- rowMeans(boot_coefs)
boot_means

Plot

Finally, we visualise the relation between lstat and medv together with the fitted regression line. Since plots do return objects, they are not tested for reproducibility.

plot(boston$lstat, boston$medv,
     xlab = "Lower status (% population)",
     ylab = "Median value of homes",
     main = "Boston Housing Data")
abline(model, col = "red")