In calebwpalmer/doctr: Smart Modeling Diagnostics

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(doctr)

"All models are wrong, and some are useful" -George Box

the dx() function, medical shorthand for diagnosis, is used to calculate model diagnostics.

We will use the star wars dataset for this vignette, which is loaded with the dplyr package.

library(dplyr)
df_starwars <- dplyr::starwars

Let's see if we can predict a character's height by using their mass:

lm_starwars <- lm(height ~ mass, data = df_starwars)
summary(lm_starwars)

It looks like mass is not a significant predictor of height (p-value = 0.312) and the model was not a great predictor of variability in height (Adj. R^2 < 0.001). Let's take a closer look at our data and see if we have any potential outliers.

doctr provides a number of statistical tests that can be used for model diagnostics. All tests within doctr contain the prefix "dx_" for a unified syntax. Common diagnostic tests used for linear modeling are Standardized Residuals, Student Residuals, Cook's Distance, and the Difference in Fits (DFITS). We can call these individually:

dx_res_stand(lm_starwars)
dx_res_stud(lm_starwars)
dx_cooks(lm_starwars)
dx_dfits(lm_starwars)

We can also use dx(), which will assign the model to each dx function called. If dx is called without specified dx_test(s), it will automatically assign conventional tests for the model type and family.*

automatic selection currently only works for lm() class models.

dx(lm_starwars, dx_cooks(), dx_res_stand())

dx(lm_starwars)

test <- dx(lm_starwars)

It looks like there are several potential outliers, but observation 16 is particularly problematic. Although we are using conventional thresholds, manually defined thresholds can be used as well. Let's see how far we can push Cook's distance and the DFITS:

dx_cooks(lm_starwars, .threshold = 500)
dx_dfits(lm_starwars, 50)

Normally the common thresholds for Cook's distance and DFITS is 1, observation 16 is a pretty significant outlier. Let's take a look at the observation's row:

df_starwars[16,]

Jabba the Hutt was the outlier, as his mass:height ratio is much different than the rest of the sample. Let's remove him from our df and try our model again.

df_sans_jabba <- df_starwars %>% 
  filter(name != "Jabba Desilijic Tiure")

lm_sans_jabba <- lm(height ~ mass, data = df_sans_jabba)
summary(lm_sans_jabba)

That's much better! Mass is now a highly significant predictor of height, and our model now accounts for 58% of variability in height among all Star Wars characters (except for Hutts)

calebwpalmer/doctr documentation built on Sept. 18, 2020, 5:18 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com