In tahmeed14/tureen625: Ordinary Least Squares

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Relevant Libraries

library(tureen625)

Requirements for Submission:

Comparison(s) against the original R functions on simulated or real datasets
Correction (e.g. via all.equal())
Efficiency (e.g. bench::mark())

Overview
Fit the Linear Regression Model
Analyze the Results
Correction Analysis
Efficiency Analysis
Analysis on Larger Dataset

Fit the Linear Regression Model

For this tutorial, we will fit a multi-variable linear regression model with some continous predictor variables and one categorical predictor variable. The data we use for demonstration is the mtcars dataset which is provided by the datasets library (More documentation on this dataset can be viewed HERE).

The regression model we will fit is shown below:

$$mpg_i = \beta_0 + \beta_1cyl_i + \beta_2hp_i + \beta_3wt_i + \beta_4vs_i$$ where

mpg : Miles/gallon
cyl : Number of cylinders
hp : Horsepower
wt : Weight (1000 lbs)
vs : Engine (0 = V-shape, 1 = manual)

Setting Up Data

# Load up mtcars dataset
data("mtcars")
dim(mtcars)

For our Ordinary Least Squares (OLS) function, we need the inputs of the outcome and independent variables to be separated and in matrix format. We demonstrate this process in the following R chunk:

# Create design matrix
# We need to add a vector of 1's for the intercept
X_matrix <- cbind(1, mtcars$cyl, mtcars$hp, mtcars$wt, mtcars$vs)

Y_vec <- as.vector(mtcars$mpg)

NOTE: You may want to fit a regression model that incorporates a categorical variable that has more than 2 classes. In that case, we recommend using the model.matrix() function to easily create a design matrix. This function will convert a matrix with factor variables into a dummy encoded (reference cell encoding) matrix (see ?model.matrix()) for additional help.

For example:

# Make sure to use as.factor() if your variable isn't in that type/class format
X_matrix2 <- stats::model.matrix(~cyl + hp + wt + as.factor(vs), data = mtcars)
X_matrix2 <- as.matrix(X_matrix2)

# Check if all 160 entries are equal in the two design matrices
all.equal(sum(X_matrix ==  X_matrix2), 160)

Fitting the Model

We input our data variables into our function fit_OLS() to fit our OLS model

OLS.fit <- tureen625::fit_OLS(design_X = X_matrix, Y = Y_vec)

names(OLS.fit)

Analyze & Visualize the Results

In this section, we show how we can use the returned object from our function to do some analyses of our results. First, we will observe our $\hat{\beta}$ estimates and the standard errors associated with the estimates $SE(\hat{\beta})$

OLS.fit$param_estimates

OLS.fit$param_StdErrors

Note that the order of the values in this vector/matric objects corresponds to the same order in which the variables are arranged in the design matrix $X$. Thus, the first value from the two outputs above corresponds to the Intercept in the regression model and the 2nd value above corresponds to the cyl variable.

# View confidence intervals
OLS.fit$conf_int

We can see that only the first and fourth $\hat{\beta}$ estimates have confidence interval intervals that do not include a zero. Therefore, we can make the claim that the weight (wt) of a car has a significant negative effect on the outcome of interest (mpg) by a magnitude of approximately -3.18 miles per gallon.

You can also view all of these results by simplying running the following code

round(OLS.fit$results, 3)

library(dotwhisker)

# Visualize the confidence interval
# We need to put our dataset in a dataframe with appropriate names
# Note: You have to use the following names for the columns so dwplot() will plot
format_data <- data.frame(cbind(term = c("cyl", "hp", "wt", "vs"),
                     estimate = round(OLS.fit$param_estimates[-1], 3),
                     conf.low = round(OLS.fit$conf_int[-1,1], 3),
                     conf.high = round(OLS.fit$conf_int[-1,2], 3)),
                     model = "OLS")


dotwhisker::dwplot(x = format_data, dodge_size = 0.6) + 
    geom_vline(xintercept = 0, color = "black") + xlim(c(-5.0, 5.0))

Correction Analysis

We compare our results with the commonly used function in R for fitting linear regression models called lm(). First, we compare the parameter estimates then the standard errors. Finally, we compare if we can draw the same inferences from both functions.

Compare $\hat{\beta}$ estimates

# Fit the model using lm()
lm.fit <- lm(formula = "mpg ~ cyl + hp + wt + vs", data = mtcars)

all.equal(as.vector(lm.fit$coefficients), as.vector(OLS.fit$param_estimates))

Our implemented R function has the same $\beta$ esimates as the lm() function

Compare $SE(\hat{\beta})$

stdErr.lm <- summary(lm.fit)$coefficients[,"Std. Error"]

all.equal(as.vector(stdErr.lm), OLS.fit$param_StdErrors)

Our implemented R function has the same $SE(\hat{\beta})$ values as the lm() function

Compare residuals

all.equal(as.vector(OLS.fit$residuals), as.vector(lm.fit$residuals))

Our implemented R function has the same residual values as the lm() function

Efficiency Analysis

Now, we compare the speed of the two functions by using the microbenchmark package.

library(microbenchmark)

microbenchmark::microbenchmark(
    lm(formula = "mpg ~ cyl + hp + wt + vs", data = mtcars),
    tureen625::fit_OLS(design_X = X_matrix, Y = Y_vec)
)

We see that our implemented function runs faster than the lm(), on average. We believe this is because our function provides estimates for the $\hat{\beta}$, the standard errors, residuals, and the confidence intervals. See below for a couple more features.

names(OLS.fit)

The lm() function provides a lot of other features such as the t-test statistic, p-values, $R^2$ value, Adjusted $R^2$. So, there must be a lot of other computations that are ongoing.

Analysis on Larger Dataset

We simulate a larger dataset and compare the speed to see if we get different results

set.seed(12)

sim.B <- rnorm(n = 5, mean = 0, sd = 10) # simulated betas

c1 <- rnorm(10000, mean = 15, sd = 2) # continuous var
c2 <- rnorm(10000, mean = 2, sd = 3) # continuous var
c3 <- rbinom(10000, size = c(0,1), prob = 0.75) # binary indicator
c4 <- rpois(n = 10000, lambda = 2) # count var

sim.X <- cbind(1, c1, c2, c3, c4)

sim.Y <- as.vector(sim.X %*% sim.B + rnorm(10000, 0,3)) # simulate Y's

summary(sim.Y)

sim.data <- data.frame(cbind(Y = sim.Y, sim.X[,-1])) # save a dataframe version for lm()
colnames(sim.data)

Correctness

lm.fit <- lm(formula = "Y ~ c1 + c2 + c3 + c4", data = sim.data)
OLS.fit <- tureen625::fit_OLS(design_X = sim.X, Y = sim.Y)

stdErr.lm <- summary(lm.fit)$coefficients[,"Std. Error"]

all.equal(as.vector(lm.fit$coefficients), as.vector(OLS.fit$param_estimates))
all.equal(as.vector(stdErr.lm), as.vector(OLS.fit$param_StdErrors))
all.equal(as.vector(OLS.fit$residuals), as.vector(lm.fit$residuals))

Results are the same when it comes to parameter estimation, std. error calculation, and residuals

Efficiency

microbenchmark::microbenchmark(
    lm(formula = "Y ~ c1 + c2 + c3 + c4", data = sim.data),
    tureen625::fit_OLS(design_X = sim.X, Y = sim.Y)
)

With a dataset that has 10000 observations, we note that both of the functions are very similar in terms of computational speed. Our function is a bit faster, but the difference is ignorable
If you want to try out larger datasets, simply simulate data using the above code as a skeleton and just increase the size from 10000 to something larger
You can also use other datasets from the datasets library to make comparisons if you really want to

tahmeed14/tureen625 documentation built on Dec. 2, 2019, 4:52 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

tahmeed14/tureen625
Ordinary Least Squares

In tahmeed14/tureen625: Ordinary Least Squares

Relevant Libraries

Table of Contents

Fit the Linear Regression Model

Setting Up Data

Fitting the Model

Analyze & Visualize the Results

Correction Analysis

Compare $\hat{\beta}$ estimates

Compare $SE(\hat{\beta})$

Compare residuals

Efficiency Analysis

Analysis on Larger Dataset

Correctness

Efficiency

R Package Documentation

Browse R Packages

We want your feedback!

tahmeed14/tureen625 Ordinary Least Squares

In tahmeed14/tureen625: Ordinary Least Squares

Relevant Libraries

Table of Contents

Fit the Linear Regression Model

Setting Up Data

Fitting the Model

Analyze & Visualize the Results

Correction Analysis

Compare $\hat{\beta}$ estimates

Compare $SE(\hat{\beta})$

Compare residuals

Efficiency Analysis

Analysis on Larger Dataset

Correctness

Efficiency

R Package Documentation

Browse R Packages

We want your feedback!

tahmeed14/tureen625
Ordinary Least Squares