knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(tureen625)
Requirements for Submission:
For this tutorial, we will fit a multi-variable linear regression model with some continous predictor variables and one categorical predictor variable. The data we use for demonstration is the mtcars
dataset which is provided by the datasets
library (More documentation on this dataset can be viewed HERE).
The regression model we will fit is shown below:
$$mpg_i = \beta_0 + \beta_1cyl_i + \beta_2hp_i + \beta_3wt_i + \beta_4vs_i$$ where
mpg
: Miles/galloncyl
: Number of cylindershp
: Horsepowerwt
: Weight (1000 lbs)vs
: Engine (0 = V-shape, 1 = manual)# Load up mtcars dataset data("mtcars") dim(mtcars)
For our Ordinary Least Squares (OLS) function, we need the inputs of the outcome and independent variables to be separated and in matrix format. We demonstrate this process in the following R
chunk:
# Create design matrix # We need to add a vector of 1's for the intercept X_matrix <- cbind(1, mtcars$cyl, mtcars$hp, mtcars$wt, mtcars$vs) Y_vec <- as.vector(mtcars$mpg)
NOTE: You may want to fit a regression model that incorporates a categorical variable that has more than 2 classes. In that case, we recommend using the model.matrix()
function to easily create a design matrix. This function will convert a matrix with factor variables into a dummy encoded (reference cell encoding) matrix (see ?model.matrix()
) for additional help.
For example:
# Make sure to use as.factor() if your variable isn't in that type/class format X_matrix2 <- stats::model.matrix(~cyl + hp + wt + as.factor(vs), data = mtcars) X_matrix2 <- as.matrix(X_matrix2) # Check if all 160 entries are equal in the two design matrices all.equal(sum(X_matrix == X_matrix2), 160)
We input our data variables into our function fit_OLS()
to fit our OLS model
OLS.fit <- tureen625::fit_OLS(design_X = X_matrix, Y = Y_vec) names(OLS.fit)
In this section, we show how we can use the returned object from our function to do some analyses of our results. First, we will observe our $\hat{\beta}$ estimates and the standard errors associated with the estimates $SE(\hat{\beta})$
OLS.fit$param_estimates
OLS.fit$param_StdErrors
Note that the order of the values in this vector/matric objects corresponds to the same order in which the variables are arranged in the design matrix $X$. Thus, the first value from the two outputs above corresponds to the Intercept in the regression model and the 2nd value above corresponds to the cyl
variable.
# View confidence intervals OLS.fit$conf_int
We can see that only the first and fourth $\hat{\beta}$ estimates have confidence interval intervals that do not include a zero. Therefore, we can make the claim that the weight (wt
) of a car has a significant negative effect on the outcome of interest (mpg
) by a magnitude of approximately -3.18 miles per gallon.
You can also view all of these results by simplying running the following code
round(OLS.fit$results, 3)
library(dotwhisker) # Visualize the confidence interval # We need to put our dataset in a dataframe with appropriate names # Note: You have to use the following names for the columns so dwplot() will plot format_data <- data.frame(cbind(term = c("cyl", "hp", "wt", "vs"), estimate = round(OLS.fit$param_estimates[-1], 3), conf.low = round(OLS.fit$conf_int[-1,1], 3), conf.high = round(OLS.fit$conf_int[-1,2], 3)), model = "OLS") dotwhisker::dwplot(x = format_data, dodge_size = 0.6) + geom_vline(xintercept = 0, color = "black") + xlim(c(-5.0, 5.0))
We compare our results with the commonly used function in R
for fitting linear regression models called lm()
. First, we compare the parameter estimates then the standard errors. Finally, we compare if we can draw the same inferences from both functions.
# Fit the model using lm() lm.fit <- lm(formula = "mpg ~ cyl + hp + wt + vs", data = mtcars)
all.equal(as.vector(lm.fit$coefficients), as.vector(OLS.fit$param_estimates))
R
function has the same $\beta$ esimates as the lm()
functionstdErr.lm <- summary(lm.fit)$coefficients[,"Std. Error"] all.equal(as.vector(stdErr.lm), OLS.fit$param_StdErrors)
R
function has the same $SE(\hat{\beta})$ values as the lm()
functionall.equal(as.vector(OLS.fit$residuals), as.vector(lm.fit$residuals))
R
function has the same residual values as the lm()
functionNow, we compare the speed of the two functions by using the microbenchmark
package.
library(microbenchmark)
microbenchmark::microbenchmark( lm(formula = "mpg ~ cyl + hp + wt + vs", data = mtcars), tureen625::fit_OLS(design_X = X_matrix, Y = Y_vec) )
lm()
, on average. We believe this is because our function provides estimates for the $\hat{\beta}$, the standard errors, residuals, and the confidence intervals. See below for a couple more features. names(OLS.fit)
The lm()
function provides a lot of other features such as the t-test statistic, p-values, $R^2$ value, Adjusted $R^2$. So, there must be a lot of other computations that are ongoing.
We simulate a larger dataset and compare the speed to see if we get different results
set.seed(12) sim.B <- rnorm(n = 5, mean = 0, sd = 10) # simulated betas c1 <- rnorm(10000, mean = 15, sd = 2) # continuous var c2 <- rnorm(10000, mean = 2, sd = 3) # continuous var c3 <- rbinom(10000, size = c(0,1), prob = 0.75) # binary indicator c4 <- rpois(n = 10000, lambda = 2) # count var sim.X <- cbind(1, c1, c2, c3, c4) sim.Y <- as.vector(sim.X %*% sim.B + rnorm(10000, 0,3)) # simulate Y's summary(sim.Y) sim.data <- data.frame(cbind(Y = sim.Y, sim.X[,-1])) # save a dataframe version for lm() colnames(sim.data)
lm.fit <- lm(formula = "Y ~ c1 + c2 + c3 + c4", data = sim.data) OLS.fit <- tureen625::fit_OLS(design_X = sim.X, Y = sim.Y) stdErr.lm <- summary(lm.fit)$coefficients[,"Std. Error"] all.equal(as.vector(lm.fit$coefficients), as.vector(OLS.fit$param_estimates)) all.equal(as.vector(stdErr.lm), as.vector(OLS.fit$param_StdErrors)) all.equal(as.vector(OLS.fit$residuals), as.vector(lm.fit$residuals))
microbenchmark::microbenchmark( lm(formula = "Y ~ c1 + c2 + c3 + c4", data = sim.data), tureen625::fit_OLS(design_X = sim.X, Y = sim.Y) )
With a dataset that has 10000 observations, we note that both of the functions are very similar in terms of computational speed. Our function is a bit faster, but the difference is ignorable
If you want to try out larger datasets, simply simulate data using the above code as a skeleton and just increase the size from 10000 to something larger
You can also use other datasets from the datasets
library to make comparisons if you really want to
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.