knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Relevant Libraries

library(tureen625)

Requirements for Submission:

Table of Contents

Fit the Linear Regression Model

For this tutorial, we will fit a multi-variable linear regression model with some continous predictor variables and one categorical predictor variable. The data we use for demonstration is the mtcars dataset which is provided by the datasets library (More documentation on this dataset can be viewed HERE).

The regression model we will fit is shown below:

$$mpg_i = \beta_0 + \beta_1cyl_i + \beta_2hp_i + \beta_3wt_i + \beta_4vs_i$$ where

Setting Up Data

# Load up mtcars dataset
data("mtcars")
dim(mtcars)

For our Ordinary Least Squares (OLS) function, we need the inputs of the outcome and independent variables to be separated and in matrix format. We demonstrate this process in the following R chunk:

# Create design matrix
# We need to add a vector of 1's for the intercept
X_matrix <- cbind(1, mtcars$cyl, mtcars$hp, mtcars$wt, mtcars$vs)

Y_vec <- as.vector(mtcars$mpg)

NOTE: You may want to fit a regression model that incorporates a categorical variable that has more than 2 classes. In that case, we recommend using the model.matrix() function to easily create a design matrix. This function will convert a matrix with factor variables into a dummy encoded (reference cell encoding) matrix (see ?model.matrix()) for additional help.

For example:

# Make sure to use as.factor() if your variable isn't in that type/class format
X_matrix2 <- stats::model.matrix(~cyl + hp + wt + as.factor(vs), data = mtcars)
X_matrix2 <- as.matrix(X_matrix2)

# Check if all 160 entries are equal in the two design matrices
all.equal(sum(X_matrix ==  X_matrix2), 160)

Fitting the Model

We input our data variables into our function fit_OLS() to fit our OLS model

OLS.fit <- tureen625::fit_OLS(design_X = X_matrix, Y = Y_vec)

names(OLS.fit)

Analyze & Visualize the Results

In this section, we show how we can use the returned object from our function to do some analyses of our results. First, we will observe our $\hat{\beta}$ estimates and the standard errors associated with the estimates $SE(\hat{\beta})$

OLS.fit$param_estimates
OLS.fit$param_StdErrors

Note that the order of the values in this vector/matric objects corresponds to the same order in which the variables are arranged in the design matrix $X$. Thus, the first value from the two outputs above corresponds to the Intercept in the regression model and the 2nd value above corresponds to the cyl variable.

# View confidence intervals
OLS.fit$conf_int

We can see that only the first and fourth $\hat{\beta}$ estimates have confidence interval intervals that do not include a zero. Therefore, we can make the claim that the weight (wt) of a car has a significant negative effect on the outcome of interest (mpg) by a magnitude of approximately -3.18 miles per gallon.

You can also view all of these results by simplying running the following code

round(OLS.fit$results, 3)
library(dotwhisker)

# Visualize the confidence interval
# We need to put our dataset in a dataframe with appropriate names
# Note: You have to use the following names for the columns so dwplot() will plot
format_data <- data.frame(cbind(term = c("cyl", "hp", "wt", "vs"),
                     estimate = round(OLS.fit$param_estimates[-1], 3),
                     conf.low = round(OLS.fit$conf_int[-1,1], 3),
                     conf.high = round(OLS.fit$conf_int[-1,2], 3)),
                     model = "OLS")


dotwhisker::dwplot(x = format_data, dodge_size = 0.6) + 
    geom_vline(xintercept = 0, color = "black") + xlim(c(-5.0, 5.0))

Correction Analysis

We compare our results with the commonly used function in R for fitting linear regression models called lm(). First, we compare the parameter estimates then the standard errors. Finally, we compare if we can draw the same inferences from both functions.

Compare $\hat{\beta}$ estimates

# Fit the model using lm()
lm.fit <- lm(formula = "mpg ~ cyl + hp + wt + vs", data = mtcars)
all.equal(as.vector(lm.fit$coefficients), as.vector(OLS.fit$param_estimates))

Compare $SE(\hat{\beta})$

stdErr.lm <- summary(lm.fit)$coefficients[,"Std. Error"]

all.equal(as.vector(stdErr.lm), OLS.fit$param_StdErrors)

Compare residuals

all.equal(as.vector(OLS.fit$residuals), as.vector(lm.fit$residuals))

Efficiency Analysis

Now, we compare the speed of the two functions by using the microbenchmark package.

library(microbenchmark)
microbenchmark::microbenchmark(
    lm(formula = "mpg ~ cyl + hp + wt + vs", data = mtcars),
    tureen625::fit_OLS(design_X = X_matrix, Y = Y_vec)
)
names(OLS.fit)

The lm() function provides a lot of other features such as the t-test statistic, p-values, $R^2$ value, Adjusted $R^2$. So, there must be a lot of other computations that are ongoing.

Analysis on Larger Dataset

We simulate a larger dataset and compare the speed to see if we get different results

set.seed(12)

sim.B <- rnorm(n = 5, mean = 0, sd = 10) # simulated betas

c1 <- rnorm(10000, mean = 15, sd = 2) # continuous var
c2 <- rnorm(10000, mean = 2, sd = 3) # continuous var
c3 <- rbinom(10000, size = c(0,1), prob = 0.75) # binary indicator
c4 <- rpois(n = 10000, lambda = 2) # count var

sim.X <- cbind(1, c1, c2, c3, c4)

sim.Y <- as.vector(sim.X %*% sim.B + rnorm(10000, 0,3)) # simulate Y's

summary(sim.Y)

sim.data <- data.frame(cbind(Y = sim.Y, sim.X[,-1])) # save a dataframe version for lm()
colnames(sim.data)

Correctness

lm.fit <- lm(formula = "Y ~ c1 + c2 + c3 + c4", data = sim.data)
OLS.fit <- tureen625::fit_OLS(design_X = sim.X, Y = sim.Y)

stdErr.lm <- summary(lm.fit)$coefficients[,"Std. Error"]

all.equal(as.vector(lm.fit$coefficients), as.vector(OLS.fit$param_estimates))
all.equal(as.vector(stdErr.lm), as.vector(OLS.fit$param_StdErrors))
all.equal(as.vector(OLS.fit$residuals), as.vector(lm.fit$residuals))

Efficiency

microbenchmark::microbenchmark(
    lm(formula = "Y ~ c1 + c2 + c3 + c4", data = sim.data),
    tureen625::fit_OLS(design_X = sim.X, Y = sim.Y)
)


tahmeed14/tureen625 documentation built on Dec. 2, 2019, 4:52 p.m.