This vignette demonstrates how to perform a robust linear regression analysis on the Boston housing data. We start by setting global chunk options and loading the required packages.
knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(reproducibleRchunks) library(MASS)
We load the Boston housing data from the MASS
package and display the
first rows to get an overview of the variables.
boston <- MASS::Boston knitr::kable(head(boston))
Next, we remove observations with extremely high or low values of
medv
using the IQR rule to reduce the influence of outliers.
iqr_medv <- IQR(boston$medv) q <- quantile(boston$medv, c(0.25, 0.75)) lower <- q[1] - 1.5 * iqr_medv upper <- q[2] + 1.5 * iqr_medv boston <- boston[boston$medv >= lower & boston$medv <= upper, ]
We then fit a simple linear regression predicting median house value
from the percentage of lower status population (lstat
).
model <- lm(medv ~ lstat, data = boston) summary(model)
To assess the variability of the coefficients we perform a simple bootstrap. Because analyses based on random numbers lead to different results across runs, we set a random seed to make this step reproducible. See Brandmaier et al. for a detailed discussion of reproducibility and random numbers (https://qcmb.psychopen.eu/index.php/qcmb/article/view/3763/3763.html).
set.seed(123) boot_coefs <- replicate(1000, { idx <- sample(nrow(boston), replace = TRUE) coef(lm(medv ~ lstat, data = boston[idx, ])) }) boot_means <- rowMeans(boot_coefs) boot_means
Finally, we visualise the relation between lstat
and medv
together
with the fitted regression line. Since plots do return objects, they
are not tested for reproducibility.
plot(boston$lstat, boston$medv, xlab = "Lower status (% population)", ylab = "Median value of homes", main = "Boston Housing Data") abline(model, col = "red")
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.