In balr411/linearRegression: Linear Regression Analysis

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(linearRegression)

The linearRegression package described in this vignette is a package capable of performing a basic linear regression analysis. The function linearRegression is most similar to the function lm in the stats package, and the anova and summary functions perform almost the exact same tasks as the anova and summary functions in the stats and base packages respectively. Throughout this tutorial, we will work through examples including analyses done on simple linear regression models, multiple linear regression models, and multiple linear regression models containing interaction terms. We will compare the functions in the linearRegression package to those in the stats and base packages using the bench package. We will begin with a simple linear regression model using the cars data set in R. This data set contains 50 data points on the speed of cars and the distances taken to stop, recorded in the 1920's. We will use the speed of the car as a predictor for the stopping time. To fit the model, we use the following code:

model_cars<-linearRegression(dist~speed, dat=cars)

Note that this is the same way to fit a model using the lm function in the stats package. To show correctness of the linearRegression function, we will also fit the model using the lm function as follows:

model_cars_lm<-lm(dist~speed, data=cars)

Now model_cars is a list length 7 containing the coefficients of the model, residuals, fitted values, rank of the fitted model, the residual degrees of freedom, the model formula used, and the model frame used. One thing of potential interest is to see the coefficients and their significance. This can be achieved through the usage of the summary function. Among other things, the summary function in the linearRegression package will return a data frame containing each of the coefficients, their degrees of freedom, their t-statistics as well as p-values. This can be seen by using the following code:

linearRegression::summary(model_cars)$coefficients

From this we can see that the p-value for testing the null hypothesis of no association between the speed of the car and the stopping time is very low and would be significant at any reasonable level of significance. By using the summary function in the base package on our model model_cars_lm, we can see the correctness of our model:

base::summary(model_cars_lm)

Equivalently, we could use the anova function to do the same test. The anova function in linearRegression takes one model from the output of a linearRegression call and outputs a data frame containing the degrees of freedom, sum of squares, mean sum of squares, F-statistic and p-value for all predictors as well as the error degrees of freedom, error sum of squares and mean sqaure error.

linearRegression::anova(model_cars)

Similarly, we could use the anova function in stats:

stats::anova(model_cars_lm)

We get the same p-value in all four calls. Furthermore, the function linearRegression gives output useful in other areas of the analysis such as model diagnostics. Both the residuals and the fitted values are returned from a call to linearRegression, hence we can assess LINE (linearity, independence, normality, equal variances) assumptions using plotting tools in base R. Equivalently, a more advanced graphing package such as ggplot2 could be used. For the sake of simplicity we use base R in this tutorial. We can assess linearity and constant variance using a plot of the residuals vs. the fitted values, independence by a plot of the residuals vs. their index, and normality with a histogram of the residuals. These are as follows:

resids<-model_cars$residuals
fitted_values<-model_cars$fitted.values
plot(fitted_values, resids, main="Residuals vs. Fitted Values", xlab="Fitted Values", ylab="Residuals")
plot(1:length(resids), resids, main="Residuals vs. Index", xlab="Index", ylab="Residuals")
hist(resids, main="Histogram of Residuals", xlab="Residuals")

We can see that the residuals are slightly right skewed, the observations are likely independent, and variance may increase as the fitted values increase. Continuing on, we will further examine the summary function in linearRegression. We have seen already that the summary function can be used to obtain p-values for each of the coefficients. It can also be used to assess model fit. The summary function also outputs R squared, adjusted R squared (useful when there are multiple predictors), and the model F-statistic and p-value. The model F-statistic and p-value can be used to test the null hypothesis that all regression coefficients are 0. For our cars model, these are as follows:

summary_model_cars<-linearRegression::summary(model_cars)
summary_model_cars$R.squared
summary_model_cars$adj.R.squared
summary_model_cars$F.stat
summary_model_cars$model.p.value

Comparing this output to the output given by the summary function in base R above, we see that we get the same answer. Finally, we can use the output from the linearRegression function for prediction. We can then compare it to prediction done by the model fit using lm.

b0_cars<-model_cars$coefficients[1]
b1_cars<-model_cars$coefficients[2]
model_predict<-function(X){
  return(b0_cars+b1_cars*X)
}

b0_cars_lm<-model_cars_lm$coefficients[1]
b1_cars_lm<-model_cars_lm$coefficients[2]
model_predict_lm<-function(X){
  return(b0_cars_lm+b1_cars_lm*X)
}

X<-rnorm(100)

all.equal(model_predict(X), model_predict_lm(X))

There are various other output items from the functions linearRegression and summary that may be of interest to package users. Both functions return the formula used in the model and can be accessed in the return list under the name formula. linearRegression returns the rank of the design matrix and the data frame containing the model data under the names rank and model. summary also returns the model residuals and the residual standard error under the names residuals and residual.se.

We will now move on to talk about multiple linear regression and multiple linear regression with interaction terms. To do this, we can consider the mtcars data set in R. We will use the number of cylinders, displacement, gross horsepower, and weight as predictors for the miles per gallon.

fit_mlr<-linearRegression::linearRegression(mpg~cyl+disp+hp+wt, dat=mtcars)

All of the functionality that has been previously described for simple linear regression is also available for multiple linear regression. For example, we can look at the estimated regression coefficients as well as their t-statistics using the summary function.

linearRegression::summary(fit_mlr)$coefficients

As we can see from fitting the same model using the lm function, we get the same results.

fit_mlr_lm<-lm(mpg~cyl+disp+hp+wt, data=mtcars)
base::summary(fit_mlr_lm)

We can also include interaction terms in our model. To include interaction between variables x and y, the term x:y should be included in form in the linearRegression call. To see this, let us consider the above model with some interaction terms involved:

fit_mlr_int<-linearRegression::linearRegression(mpg~cyl+disp+hp+wt+cyl:disp+disp:wt+cyl:disp:hp, dat=mtcars)
linearRegression::summary(fit_mlr_int)$coefficients

Comparing to the same call in lm:

fit_mlr_int_lm<-lm(mpg~cyl+disp+hp+wt+cyl:disp+disp:wt+cyl:disp:hp, data=mtcars)
base::summary(fit_mlr_int_lm)

Again we can see the correctness of the function. Before moving on to benchmarking checks, it should be noted that the linearRegression function cannot take a factor as input, and no other formula formats are accecpted besides those written in this tutorial. If a factor is needed in your analysis, you must create your own dummy variables to include in the call to linearRegression.

We will now move on to benchmarking the function linearRegression against lm. We will first do this using the bench package. We will use the multiple linear regression model fit previously to do this.

bench::mark(linearRegression::linearRegression(mpg~cyl+disp+hp+wt, dat=mtcars), lm(mpg~cyl+disp+hp+wt, dat=mtcars), check=FALSE)[,3]

This shows that the median run time for linearRegression is about one third as fast as lm when we use this mtcars data set. This is a small data set, and linearRegression perfroms slightly worse as the input size increases compared to lm.

y<-rnorm(1000)
x<-rnorm(1000)

bench::mark(linearRegression::linearRegression(y~x), lm(y~x), check=FALSE)[,3]

Here linearRegression is about 20 times slower than lm. This is however not terrible, as lm is one of the most used functions in R and must be optimized accordingly. We can also benchmark anova and summary against their base/stats package counterparts.

fit_bm_linReg<-linearRegression::linearRegression(y~x)
fit_bm_lm<-lm(y~x)

bench::mark(linearRegression::anova(fit_bm_linReg), stats::anova(fit_bm_lm), check=FALSE)[,3]

bench::mark(linearRegression::summary(fit_bm_linReg), base::summary(fit_bm_lm), check=FALSE)[,3]

In this case, our anova function actually performs marginally better than the one included in stats, while our summary function is approximately 3 times slower. Note that the comparisons done against lm and summary are slightly erroneous. The lm function from stats and summary from base return more information than our implementations, and hence do more work. The true benchmarking between our function and lm and stats is likely slightly slower than shown here. Since the format that we return our results is different than lm, we had to set check=FALSE above. To see correctness of our functions more clearly, we set up the following loop. At each iteration, the for loop randomly samples data from the normal distribution and fits linear regression models based on it. We use all.equal to compare the results from linearRegression to the results from lm.

for(i in 1:100){
  y<-rnorm(100)
  x1<-rnorm(100)
  x2<-rnorm(100)

  fit_linReg<-linearRegression::linearRegression(y~x1+x2)
  fit_lm<-lm(y~x1+x2)

  check1<-all.equal(as.numeric(fit_linReg$coefficients), as.numeric(fit_lm$coefficients))
  check2<-all.equal(as.numeric(fit_linReg$residuals), as.numeric(fit_lm$residuals))
  check3<-all.equal(fit_linReg$rank,fit_lm$rank)
  check4<-all.equal(as.numeric(fit_linReg$fitted.values), as.numeric(fit_lm$fitted.values))
  check5<-all.equal(fit_linReg$df.residual,fit_lm$df.residual) 

  anova_fit_linReg<-linearRegression::anova(fit_linReg)
  anova_fit_lm<-stats::anova(fit_lm)

  check6<-all.equal(as.numeric(anova_fit_linReg[[1]]), as.numeric(anova_fit_lm[[1]]))
  check7<-all.equal(as.numeric(anova_fit_linReg[[2]]), as.numeric(anova_fit_lm[[2]]))
  check8<-all.equal(as.numeric(anova_fit_linReg[[3]][1:(length(anova_fit_linReg[[3]])-2)]), as.numeric(anova_fit_lm[[3]][1:(length(anova_fit_lm[[3]])-2)]))
  check9<-all.equal(as.numeric(anova_fit_linReg[[4]][1:(length(anova_fit_linReg[[4]])-2)]), as.numeric(anova_fit_lm[[4]][1:(length(anova_fit_lm[[4]])-2)]))

  summary_fit_linReg<-linearRegression::summary(fit_linReg)
  summary_fit_lm<-base::summary(fit_lm)


  check10<-all.equal(summary_fit_linReg$R.squared,summary_fit_lm$r.squared)
  check11<-all.equal(summary_fit_linReg$adj.R.squared,summary_fit_lm$adj.r.squared)
  check12<-all.equal(as.vector(summary_fit_linReg$coefficients[,2]), as.vector(summary_fit_lm$coefficients[,2]))
  check13<-all.equal(as.vector(summary_fit_linReg$coefficients[,3]), as.vector(summary_fit_lm$coefficients[,3]))
  check14<-all.equal(as.vector(summary_fit_linReg$coefficients[,4]), as.vector(summary_fit_lm$coefficients[,4]))
  check15<-all.equal(summary_fit_linReg$F.stat, as.numeric(summary_fit_lm$fstatistic[1]))
  check16<-all.equal(summary_fit_linReg$df, as.vector(summary_fit_lm$fstatistic[2:3]))

  if(check1+check2+check3+check4+check5+check6+check7+check8+check9+check10+check11+check12+check13+check14+check15+check16<16){
    print("Not all equal")
  }
}

The above checks all parts of the implemented functions for correctness by checking against the output of the base/stats package lm, anova, and summary functions. "Not all equal" was not printed once, meaning our functions were correct on each run. The following for loop does the same checks as the previous for loop, but this time we include a model with interaction.

for(i in 1:100){
  y<-rnorm(100)
  x1<-rnorm(100)
  x2<-rnorm(100)
  x3<-rnorm(100)

  fit_linReg<-linearRegression::linearRegression(y~x1+x2+x3+x1:x2:x3)
  fit_lm<-lm(y~x1+x2+x3+x1:x2:x3)

  check1<-all.equal(as.numeric(fit_linReg$coefficients), as.numeric(fit_lm$coefficients))
  check2<-all.equal(as.numeric(fit_linReg$residuals), as.numeric(fit_lm$residuals))
  check3<-all.equal(fit_linReg$rank,fit_lm$rank)
  check4<-all.equal(as.numeric(fit_linReg$fitted.values), as.numeric(fit_lm$fitted.values))
  check5<-all.equal(fit_linReg$df.residual,fit_lm$df.residual) 

  anova_fit_linReg<-linearRegression::anova(fit_linReg)
  anova_fit_lm<-stats::anova(fit_lm)

  check6<-all.equal(as.numeric(anova_fit_linReg[[1]]), as.numeric(anova_fit_lm[[1]]))
  check7<-all.equal(as.numeric(anova_fit_linReg[[2]]), as.numeric(anova_fit_lm[[2]]))
  check8<-all.equal(as.numeric(anova_fit_linReg[[3]][1:(length(anova_fit_linReg[[3]])-2)]), as.numeric(anova_fit_lm[[3]][1:(length(anova_fit_lm[[3]])-2)]))
  check9<-all.equal(as.numeric(anova_fit_linReg[[4]][1:(length(anova_fit_linReg[[4]])-2)]), as.numeric(anova_fit_lm[[4]][1:(length(anova_fit_lm[[4]])-2)]))

  summary_fit_linReg<-linearRegression::summary(fit_linReg)
  summary_fit_lm<-base::summary(fit_lm)


  check10<-all.equal(summary_fit_linReg$R.squared,summary_fit_lm$r.squared)
  check11<-all.equal(summary_fit_linReg$adj.R.squared,summary_fit_lm$adj.r.squared)
  check12<-all.equal(as.vector(summary_fit_linReg$coefficients[,2]), as.vector(summary_fit_lm$coefficients[,2]))
  check13<-all.equal(as.vector(summary_fit_linReg$coefficients[,3]), as.vector(summary_fit_lm$coefficients[,3]))
  check14<-all.equal(as.vector(summary_fit_linReg$coefficients[,4]), as.vector(summary_fit_lm$coefficients[,4]))
  check15<-all.equal(summary_fit_linReg$F.stat, as.numeric(summary_fit_lm$fstatistic[1]))
  check16<-all.equal(summary_fit_linReg$df, as.vector(summary_fit_lm$fstatistic[2:3]))

  if(check1+check2+check3+check4+check5+check6+check7+check8+check9+check10+check11+check12+check13+check14+check15+check16<16){
    print("Not all equal")
  }
}

Again, "Not all equal" was not printed, hence each output from the linearRegression package was correct. These same 16 checks and more were included during unit testing as can be found in the tests directory. As mentioned in the README.md, the unit tets were always correct and they covered 100% of the code in the package. We hence can conclude this tutorial by stating that the linearRegression package is a correct and efficient package capable of performing a basic linear regression analysis which can be used for inference and prediction.

balr411/linearRegression documentation built on Dec. 31, 2020, 7:55 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com