knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
## Run the command below to install the package first if it needs to be installed
# devtools::install_github("rguin26/HW4")
library(HW4)

'get_lin_least_sq_model'

Simple example on how to use the function get_lin_least_sq_model:

x <- sample(100, 30, replace=FALSE)
y <- sample(100, 30, replace=TRUE)

ols_result <- get_lin_least_sq_model(x, y)

The get_lin_least_sq_model function performs linear least squares, using either ordinary least squares (OLS) or weighted least squares (WLS). Also, both simple and linear regression can be performed. It returns a list of objects, including calculations of the estimates for the beta coefficients for each predictor, along with residuals and model evaluation metrics.

Other than the main data parameters, x and y, other parameters that the function takes in include intercept and weighted. intercept = TRUE by default, meaning that an intercept will be used for the calculation of the beta coefficients. The values of the beta coefficients vector correspond to their respective column in x, and start with (intercept) if intercept = TRUE. weighted = FALSE by default, as the function therefore performs OLS. If weighted = TRUE, then WLS is performed. The following shows the different options for the same data:

x <- sample(100, 30, replace=FALSE)
y <- sample(100, 30, replace=TRUE)

ols_result <- get_lin_least_sq_model(x, y)
wls_result <- get_lin_least_sq_model(x, y, weighted = TRUE)
ols_result_no_intercept <- get_lin_least_sq_model(x, y, intercept = FALSE)
wls_result_no_intercept <- get_lin_least_sq_model(x, y, intercept = FALSE, weighted = TRUE)

This function works with numeric data only, for both x and y. Otherwise, an error is thrown, such as in the following:

x <- c(8, "v", "f", 7, "o")
y <- c(8, 9, 8, 7, 5)
x <- c(1, 5, 7, 3, 9)
y <- c(8, "v", "f", 7, "o")

As mentioned before, the get_lin_least_sq_model function performs multiple linear regression as well. Below are some examples which use randomly selected numbers, and x can be either a matrix or dataframe.

x <- matrix(sample(100, 30*5, replace=TRUE), ncol = 5)
y <- sample(100, 30, replace=TRUE)

ols_result <- get_lin_least_sq_model(x, y)
wls_result <- get_lin_least_sq_model(x, y, weighted = TRUE)
ols_result_no_intercept <- get_lin_least_sq_model(x, y, intercept = FALSE)
wls_result_no_intercept <- get_lin_least_sq_model(x, y, intercept = FALSE, weighted = TRUE)
x <- data.frame(matrix(sample(1000, 30*5, replace=FALSE), ncol = 5))
y <- sample(100, 30, replace=TRUE)

ols_result <- get_lin_least_sq_model(x, y)
wls_result <- get_lin_least_sq_model(x, y, weighted = TRUE)
ols_result_no_intercept <- get_lin_least_sq_model(x, y, intercept = FALSE)
wls_result_no_intercept <- get_lin_least_sq_model(x, y, intercept = FALSE, weighted = TRUE)

To view beta coefficients (of any model):

ols_result$beta

To view beta fitted values (of any model):

ols_result_no_intercept$fitted_values

To view residuals (of any model):

wls_result_no_intercept$residuals

And finally, to view model evaluation metrics (of any model):

wls_result$model_eval_metrics

'lin_least_squares_train_test'

The lin_least_squares_train_test function performs similarly to the get_lin_least_sq_model function, only it splits the initial data into training and testing subsets. It randomly selects a specified proportion, 0.8 by default, of instances from x and y, of identical indexes, to use for training the linear least squares model, and then it uses the remaining instances of x and y for testing the model. This function also returns a list of objects, including calculations of the estimates for the beta coefficients for each predictor of the model, trained by the training subset, along with residuals and model evaluation metrics for both training and testing subsets.

Simple example on how to use the function lin_least_squares_train_test:

x <- data.frame(matrix(sample(100000, 200*5, replace=TRUE), ncol = 5))
y <- sample(100, 200, replace=TRUE)

model_stats_1 <- lin_least_squares_train_test(x, y)
model_stats_2 <- lin_least_squares_train_test(x, y, train_set_prop = 0.75)
model_stats_3 <- lin_least_squares_train_test(x, y, weighted = TRUE, train_set_prop = 0.75)

To view beta:

model_stats_1$beta

To view training fitted values:

model_stats_2$training_fitted_values

To view training residuals:

model_stats_2$training_residuals

To view training model evaluation metrics:

model_stats_3$training_model_eval_metrics

To view testing fitted values:

model_stats_1$testing_fitted_values

To view testing residuals:

model_stats_3$testing_residuals

To view testing model evaluation metrics:

model_stats_3$testing_model_eval_metrics

"Comparisons of features between the get_lin_least_sq_model function and default lm function"

Here are some comparisons between the results produced by the get_lin_least_sq_model function and default lm function. This time, the "mtcars" dataset will be used.

To start off, the data is prepared according to the following.

x <- mtcars[, c("cyl", "disp", "hp", "drat", "wt", "qsec", "gear", "carb")]
y <- mtcars$mpg

Now, let's create models/objects based on each case: OLS, WLS, OLS with no intercept, and WLS with no intercept.

OLS:

ols_result <- get_lin_least_sq_model(x, y)
model <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + gear + carb, data = mtcars)

WLS:

wls_result <- get_lin_least_sq_model(x, y, weighted = TRUE)

model <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + gear + carb, data = mtcars)
wts <- 1 / lm(abs(model$residuals) ~ model$fitted.values)$fitted.values^2
model_weighted <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + gear + carb, data = mtcars, weights = wts)

OLS with no intercept:

ols_result_no_intercept <- get_lin_least_sq_model(x, y, intercept = FALSE)
model_no_intercept <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + gear + carb - 1, data = mtcars)

WLS with no intercept:

wls_result_no_intercept <- get_lin_least_sq_model(x, y, intercept = FALSE, weighted = TRUE)
model_no_intercept <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + gear + carb - 1, data = mtcars)
wts <- 1 / lm(abs(model_no_intercept$residuals) ~ model_no_intercept$fitted.values)$fitted.values^2
model_weighted_no_intercept <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + gear + carb - 1, data = mtcars, weights = wts)

Viewing the beta coefficients.

OLS:

ols_result$beta
model$coefficients

WLS:

wls_result$beta
model_weighted$coefficients

OLS with no intercept:

ols_result_no_intercept$beta
model_no_intercept$coefficients

WLS with no intercept:

wls_result_no_intercept$beta
model_weighted_no_intercept$coefficients

As shown, the beta coefficients are just about equal to each other. However, if you compare them using the == operator, then it does not match the expectations.

ols_result$beta == model$coefficients
wls_result$beta == model_weighted$coefficients
ols_result_no_intercept$beta == model_no_intercept$coefficients
wls_result_no_intercept$beta == model_weighted_no_intercept$coefficients

The beta coefficients from both functions are quite similar, but not exactly equal. They are very similar in value, as seen in the outputs, so they are not significantly different. With a threshold of 0.00001, it is guaranteed that their differences are not significant.

abs(ols_result$beta == model$coefficients) < 0.00001
abs(wls_result$beta == model_weighted$coefficients) < 0.00001
abs(ols_result_no_intercept$beta == model_no_intercept$coefficients) < 0.00001
abs(wls_result_no_intercept$beta == model_weighted_no_intercept$coefficients) < 0.00001

Now, let's evaluate the fitted values and residuals for one of the specific cases instead of all 4. Let's do the WLS model.

WLS fitted values:

unname(wls_result$fitted_values)
unname(model_weighted$fitted.values)
# Direct equality
unname(wls_result$fitted_values == model_weighted$fitted.values)
# Precise comparison to the 0.00001 threshold
unname(abs(wls_result$fitted_values- model_weighted$fitted.values) < 0.00001)

WLS residuals:

unname(wls_result$residuals)
unname(model_weighted$residuals)
# Direct equality
unname(wls_result$residuals == model_weighted$residuals)
# Precise comparison to the 0.00001 threshold
unname(abs(wls_result$residuals - model_weighted$residuals) < 0.00001)

Once again, you notice that the direct equality fails between the results of each function, but are really precise, and their absolute differences are less than 0.00001, an appropriate threshold for minimizing errors. This is consistent for all the other cases.

Just for one more comparison, let's quickly view the error metrics of the WLS object, and compare the R-squared and adjusted R-squared values.

wls_result$model_eval_metrics
summary(model_weighted)

This is perhaps initially a little more difficult to compare, but it is more or less clear how the results produced by the summary of the model_weighted object, produced by the lm function, are quite precise to 4 decimal places.

Efficiency of get_lin_least_sq_model function and default lm function"

Now that we discussed the accuracies of the the different features between the get_lin_least_sq_model and the default lm function, including the beta coefficients, the fitted values, the residuals, plus the R-squared and adjusted R-squared values, let's evaluate how efficient the two different functions are.

The first thing to do is load the "bench" library.

## If `bench` library is not yet installed, run the command below to install it
# install.packages("bench")
library(bench)

Let's continue to work with the same x and y data from the "mtcars" dataset. To start, let's recreate the data just to be safe in case something was changed from before.

x <- mtcars[, c("cyl", "disp", "hp", "drat", "wt", "qsec", "gear", "carb")]
y <- mtcars$mpg

Let's create the models/objects with each function for every case again, and this time, let's time how long it takes to create the object and display the beta coefficients, fitted values, and residuals.

OLS:

# Timing of the default "lm" function
time1 <- bench::system_time({
  model <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + gear + carb, data = mtcars)
  model$coefficients
  unname(model$fitted.values)
  unname(model$residuals)
})

# Timing of the "get_lin_least_sq_model" function
time2 <- bench::system_time({
  ols_result <- get_lin_least_sq_model(x, y)
  ols_result$beta
  unname(ols_result$fitted_values)
  unname(ols_result$residuals)
})

# Printing the times of each function
time1
time2

# Verifying that the real timing of the "get_lin_least_sq_model" function is less than that of the default "lm" function
(time2 < time1)[2]

WLS:

# Timing of the default "lm" function
time1 <- bench::system_time({
  model <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + gear + carb, data = mtcars)
  wts <- 1 / lm(abs(model$residuals) ~ model$fitted.values)$fitted.values^2
  model_weighted <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + gear + carb, data = mtcars, weights = wts)
  model_weighted$coefficients
  unname(model_weighted$fitted.values)
  unname(model_weighted$residuals)
})

# Timing of the "get_lin_least_sq_model" function
time2 <- bench::system_time({
  wls_result <- get_lin_least_sq_model(x, y, weighted = TRUE)
  wls_result$beta
  unname(wls_result$fitted_values)
  unname(wls_result$residuals)
})

# Printing the times of each function
time1
time2

# Verifying that the real timing of the "get_lin_least_sq_model" function is less than that of the default "lm" function
(time2 < time1)[2]

OLS with no intercept:

# Timing of the default "lm" function
time1 <- bench::system_time({
  model <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + gear + carb - 1, data = mtcars)
  model$coefficients
  unname(model$fitted.values)
  unname(model$residuals)
})

# Timing of the "get_lin_least_sq_model" function
time2 <- bench::system_time({
  ols_result <- get_lin_least_sq_model(x, y, intercept = FALSE)
  ols_result$beta
  unname(ols_result$fitted_values)
  unname(ols_result$residuals)
})

# Printing the times of each function
time1
time2

# Verifying that the real timing of the "get_lin_least_sq_model" function is less than that of the default "lm" function
(time2 < time1)[2]

WLS with no intercept:

# Timing of the default "lm" function
time1 <- bench::system_time({
  model_no_intercept <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + gear + carb - 1, data = mtcars)
  wts <- 1 / lm(abs(model_no_intercept$residuals) ~ model_no_intercept$fitted.values)$fitted.values^2
  model_weighted_no_intercept <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + gear + carb - 1, data = mtcars, weights = wts)
  model_weighted_no_intercept$coefficients
  unname(model_weighted_no_intercept$fitted.values)
  unname(model_weighted_no_intercept$residuals)
})

# Timing of the "get_lin_least_sq_model" function
time2 <- bench::system_time({
  wls_result <- get_lin_least_sq_model(x, y, intercept = FALSE, weighted = TRUE)
  wls_result$beta
  unname(wls_result$fitted_values)
  unname(wls_result$residuals)
})

# Printing the times of each function
time1
time2

# Verifying that the real timing of the "get_lin_least_sq_model" function is less than that of the default "lm" function
(time2 < time1)[2]

As seen in all the cases above, the get_lin_least_sq_model is significantly faster than the default lm function. This proves that the get_lin_least_sq_model function from this particular library, "HW4", is really efficient at computing beta coefficients, fitted values, and residuals of any linear least squares method, including OLS, WLS, OLS with no intercept, and WLS with no intercept.

"Evaluating efficiency of C++ ("Rcpp") function for residuals vs R implementation of same function"

One of the helper functions used for both the get_lin_least_sq_model function and the lin_least_squares_train_test function returns the residuals of any fitted values and actual y values passed in to it as parameters. It was initially implemented using C++ (Rcpp), as it is expected that C++ should perform much faster at certain operations or methods than R, but then it was reverted back to R due to running errors.

The following commented-out script could show efficiencies between the two implementations of the function. However, due to running problems encountered while trying to implement the Rcpp functions. It was tested elsewhere, and can always be done, but did not work properly in this vignette. It turned out that the Rcpp implementation of the function is sometimes faster than the R implementation, but not always, and it does not depend on the size of the two vectors. This is not exactly what was expected, but it was certainly worth evaluating how long it takes each implementation of a function to perform a specific task.

# ## If the `Rcpp` library is not yet installed, uncomment and run the command below to install it
# # install.packages("Rcpp")
# library(Rcpp)
# 
# ## R implementation:
# get_residuals_r <- function(fitted_values, y) {
#   if (length(fitted_values) != length(y)) {
#     stop("vectors must be of same length")
#   }
#   return(y - fitted_values)
# }
# 
# ## C++ (Rcpp) implementation:
# cppFunction('
#   NumericVector get_residuals_cpp(const NumericVector fitted_values, const NumericVector y){
#     if (fitted_values.length() != y.length()) stop ("vectors must be of same length");
#     return y - fitted_values;
#   }
# ')
# 
# ## To evaluate the efficiency of both implementations of this particular
# ## function, let's start out with two simple vectors of fitted values and y, and
# ## then print their results to first verify that they are similar.
# 
# fitted_values <- sample(500, 10)
# y <- sample(100, 10)
# 
# print(get_residuals_r(fitted_values, y))
# print(get_residuals_cpp(fitted_values, y))
# 
# ## We just verified that the two implementations of the function produce similar
# ## results. Now, let's run some more examples of fitted values and y, of
# ## different vector sizes in every case, to see if the Rcpp function performs
# ## any faster than the standard R function when generating and returning a
# ## vector of residuals.
# 
# ## If `bench` library is not yet installed, uncomment and run the command below to install it
# # install.packages("bench")
# library(bench)
# 
# #############
# ## Example 1:
# ## Creating two random vectors of the same size
# vec_size <- 10
# fitted_values <- sample(500, vec_size)
# y <- sample(100, vec_size)
# 
# ## Timing of the R-implementation of the "get_residuals" function
# time1 <- bench::system_time({
#   vec1 <- get_residuals_r(fitted_values, y)
# })
# 
# ## Timing of the Rcpp-implementation of the "get_residuals" function
# time2 <- bench::system_time({
#   vec2 <- get_residuals_cpp(fitted_values, y)
# })
# 
# ## Printing the times of each function
# time1
# time2
# 
# ## Verifying that the real timing of the C++ implementation is less than that of the default R implementation
# (time2 < time1)[2]
# 
# #############
# ## Example 2:
# ## Creating two random vectors of the same size
# vec_size <- 50
# fitted_values <- sample(500, vec_size)
# y <- sample(100, vec_size)
# 
# ## Timing of the R-implementation of the "get_residuals" function
# time1 <- bench::system_time({
#   vec1 <- get_residuals_r(fitted_values, y)
# })
# 
# ## Timing of the Rcpp-implementation of the "get_residuals" function
# time2 <- bench::system_time({
#   vec2 <- get_residuals_cpp(fitted_values, y)
# })
# 
# ## Printing the times of each function
# time1
# time2
# 
# ## Verifying that the real timing of the C++ implementation is less than that of the default R implementation
# (time2 < time1)[2]
# 
# #############
# ## Example 3:
# ## Creating two random vectors of the same size
# vec_size <- 100
# fitted_values <- sample(500, vec_size)
# y <- sample(100, vec_size)
# 
# ## Timing of the R-implementation of the "get_residuals" function
# time1 <- bench::system_time({
#   vec1 <- get_residuals_r(fitted_values, y)
# })
# 
# ## Timing of the Rcpp-implementation of the "get_residuals" function
# time2 <- bench::system_time({
#   vec2 <- get_residuals_cpp(fitted_values, y)
# })
# 
# ## Printing the times of each function
# time1
# time2
# 
# ## Verifying that the real timing of the C++ implementation is less than that of the default R implementation
# (time2 < time1)[2]
# 
# #############
# ## Example 4:
# ## Creating two random vectors of the same size
# vec_size <- 500
# fitted_values <- sample(10000, vec_size)
# y <- sample(1000, vec_size)
# 
# ## Timing of the R-implementation of the "get_residuals" function
# time1 <- bench::system_time({
#   vec1 <- get_residuals_r(fitted_values, y)
# })
# 
# ## Timing of the Rcpp-implementation of the "get_residuals" function
# time2 <- bench::system_time({
#   vec2 <- get_residuals_cpp(fitted_values, y)
# })
# 
# ## Printing the times of each function
# time1
# time2
# 
# ## Verifying that the real timing of the C++ implementation is less than that of the default R implementation
# (time2 < time1)[2]
# 
# #############
# ## Example 5:
# ## Creating two random vectors of the same size
# vec_size <- 1000
# fitted_values <- sample(10000, vec_size)
# y <- sample(12000, vec_size)
# 
# ## Timing of the R-implementation of the "get_residuals" function
# time1 <- bench::system_time({
#   vec1 <- get_residuals_r(fitted_values, y)
# })
# 
# ## Timing of the Rcpp-implementation of the "get_residuals" function
# time2 <- bench::system_time({
#   vec2 <- get_residuals_cpp(fitted_values, y)
# })
# 
# ## Printing the times of each function
# time1
# time2
# 
# ## Verifying that the real timing of the C++ implementation is less than that of the default R implementation
# (time2 < time1)[2]


rguin26/HW4 documentation built on Dec. 22, 2021, 3:09 p.m.