In brucebcampbell/appregr: Demonstrates Regression Workflow and Diagnostics

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(appregr)
library(pander)

Fit Model

modelname<-'prostate'
mdf <- appregr::getmodel(modelname)
df <- mdf$data
lm.fit <- mdf$fit

Check Leverage

high.leverage <- appregr::checkleverage(lm.fit,df)
pander(high.leverage, caption = "High Leverage Data Elements")

We've used the rule of thumb that points with a leverage greater than $\frac{2 p }{n}$ should be looked at.

Check for outliers.

resutls<- appregr::checkoutliers(lm.fit = lm.fit,df = df)

pander(resutls$residualsrange, caption="Range of Studentized residuals")
pander(resutls$correctedtval, caption = "Bonferroni corrected t-value")
if(nrow(resutls$outliers)>=1)
{
  pander(resutls$outliers, caption = "outliers")
}

Here we look for studentized residuals that fall outside the interval given by the Bonferroni corrected t-values.

Check for influential points.

We plot the Cook's distances and the residual-leverage plot with level set contours of the Cook distance.

library(ggplot2)
plot(lm.fit,which =4)
plot(lm.fit,which = 5)

Check for structure in the model.

Plot residuals versus predictors

library(ggplot2)
predictors <-names(lm.fit$coefficients)
predictors <- predictors[2:length(predictors)]

for(i in 1:length(predictors))
{
  predictor <- predictors[i]

  gg<- ggplot(data.frame(x=df[,predictor], y=residuals(lm.fit)),aes(x=x,y=y)) +geom_point()  +    
  geom_smooth(method="loess",se=FALSE)+ 
  labs(subtitle="residual plot", 
       y="residuals", 
       x=predictor, 
       title=paste(predictor, " versus residuals", sep = ''), 
       caption = modelname)
  plot(gg)
}

Partial Regression Plots

Partial regression plots - also referred to as added variable plots - attempt to show the effect of adding a variable to a model that already has one or more independent variables. For univariate linear regression, a plot of $x \sim y$ gives a good idea of the relationship between the predictor and the response. It's hard to visualize the individual predictor response relationship for more than one or two predictors. Partial regression plots attemnt to do this.

Let the full model be $y =\beta_0 + \beta_1 x_1 + \ldots \beta_n x_n + \epsilon$ Then to form the partial regression for $x_i$ we regress $y$ on all but $x_i$ and calculate the residuals $\epsilon_{y \sim x_{[j \neq i]}}$ we plot those against the residuals of $x_i \sim x_{[j \neq i]}$ $\epsilon_{x_i \sim x_{[j \neq i]}}$.

There are some nice properties of $\epsilon_{y \sim x_{[j \neq i]}} \sim \epsilon_{x_i \sim x_{[j \neq i]}}$ The least quares fit gives a line with intercept zero and slope equal to the coefficient $\beta_i$ and the residuals are the same as the residuals in the full modes.

We can see many different types of violations of linear model assumptions in the partial regression plot like high leverage points, non-linearity, non-constant variance, etc.

prr <- appregr::partialregression(lm.fit = lm.fit,df=df)

predictors <-names(prr)
for(i in 1:length(predictors))
{
  predictor <- predictors[i]
  responseresiduals <-prr[[predictor]]$responseresiduals
  covariateresiduals<- prr[[predictor]]$covariateresiduals
  plot(covariateresiduals,responseresiduals,xlab=paste(predictor, " residuals",sep=''),ylab="response residuals",main = paste("Partial regression plot for " , predictor,sep=''))

}