knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
## Run the command below to install the package first if it needs to be installed # devtools::install_github("rguin26/HW4") library(HW4)
Simple example on how to use the function get_lin_least_sq_model
:
x <- sample(100, 30, replace=FALSE) y <- sample(100, 30, replace=TRUE) ols_result <- get_lin_least_sq_model(x, y)
The get_lin_least_sq_model
function performs linear least squares, using either ordinary least squares (OLS) or weighted least squares (WLS). Also, both simple and linear regression can be performed. It returns a list of objects, including calculations of the estimates for the beta coefficients for each predictor, along with residuals and model evaluation metrics.
Other than the main data parameters, x and y, other parameters that the function takes in include intercept
and weighted
. intercept = TRUE
by default, meaning that an intercept will be used for the calculation of the beta coefficients. The values of the beta coefficients vector correspond to their respective column in x, and start with (intercept) if intercept = TRUE
. weighted = FALSE
by default, as the function therefore performs OLS. If weighted = TRUE
, then WLS is performed. The following shows the different options for the same data:
x <- sample(100, 30, replace=FALSE) y <- sample(100, 30, replace=TRUE) ols_result <- get_lin_least_sq_model(x, y) wls_result <- get_lin_least_sq_model(x, y, weighted = TRUE) ols_result_no_intercept <- get_lin_least_sq_model(x, y, intercept = FALSE) wls_result_no_intercept <- get_lin_least_sq_model(x, y, intercept = FALSE, weighted = TRUE)
This function works with numeric data only, for both x and y. Otherwise, an error is thrown, such as in the following:
x <- c(8, "v", "f", 7, "o") y <- c(8, 9, 8, 7, 5)
x <- c(1, 5, 7, 3, 9) y <- c(8, "v", "f", 7, "o")
As mentioned before, the get_lin_least_sq_model
function performs multiple linear regression as well. Below are some examples which use randomly selected numbers, and x can be either a matrix or dataframe.
x <- matrix(sample(100, 30*5, replace=TRUE), ncol = 5) y <- sample(100, 30, replace=TRUE) ols_result <- get_lin_least_sq_model(x, y) wls_result <- get_lin_least_sq_model(x, y, weighted = TRUE) ols_result_no_intercept <- get_lin_least_sq_model(x, y, intercept = FALSE) wls_result_no_intercept <- get_lin_least_sq_model(x, y, intercept = FALSE, weighted = TRUE)
x <- data.frame(matrix(sample(1000, 30*5, replace=FALSE), ncol = 5)) y <- sample(100, 30, replace=TRUE) ols_result <- get_lin_least_sq_model(x, y) wls_result <- get_lin_least_sq_model(x, y, weighted = TRUE) ols_result_no_intercept <- get_lin_least_sq_model(x, y, intercept = FALSE) wls_result_no_intercept <- get_lin_least_sq_model(x, y, intercept = FALSE, weighted = TRUE)
To view beta coefficients (of any model):
ols_result$beta
To view beta fitted values (of any model):
ols_result_no_intercept$fitted_values
To view residuals (of any model):
wls_result_no_intercept$residuals
And finally, to view model evaluation metrics (of any model):
wls_result$model_eval_metrics
The lin_least_squares_train_test
function performs similarly to the get_lin_least_sq_model
function, only it splits the initial data into training and testing subsets. It randomly selects a specified proportion, 0.8 by default, of instances from x and y, of identical indexes, to use for training the linear least squares model, and then it uses the remaining instances of x and y for testing the model. This function also returns a list of objects, including calculations of the estimates for the beta coefficients for each predictor of the model, trained by the training subset, along with residuals and model evaluation metrics for both training and testing subsets.
Simple example on how to use the function lin_least_squares_train_test
:
x <- data.frame(matrix(sample(100000, 200*5, replace=TRUE), ncol = 5)) y <- sample(100, 200, replace=TRUE) model_stats_1 <- lin_least_squares_train_test(x, y) model_stats_2 <- lin_least_squares_train_test(x, y, train_set_prop = 0.75) model_stats_3 <- lin_least_squares_train_test(x, y, weighted = TRUE, train_set_prop = 0.75)
To view beta:
model_stats_1$beta
To view training fitted values:
model_stats_2$training_fitted_values
To view training residuals:
model_stats_2$training_residuals
To view training model evaluation metrics:
model_stats_3$training_model_eval_metrics
To view testing fitted values:
model_stats_1$testing_fitted_values
To view testing residuals:
model_stats_3$testing_residuals
To view testing model evaluation metrics:
model_stats_3$testing_model_eval_metrics
get_lin_least_sq_model
function and default lm
function"Here are some comparisons between the results produced by the get_lin_least_sq_model
function and default lm
function. This time, the "mtcars" dataset will be used.
To start off, the data is prepared according to the following.
x <- mtcars[, c("cyl", "disp", "hp", "drat", "wt", "qsec", "gear", "carb")] y <- mtcars$mpg
Now, let's create models/objects based on each case: OLS, WLS, OLS with no intercept, and WLS with no intercept.
OLS:
ols_result <- get_lin_least_sq_model(x, y) model <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + gear + carb, data = mtcars)
WLS:
wls_result <- get_lin_least_sq_model(x, y, weighted = TRUE) model <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + gear + carb, data = mtcars) wts <- 1 / lm(abs(model$residuals) ~ model$fitted.values)$fitted.values^2 model_weighted <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + gear + carb, data = mtcars, weights = wts)
OLS with no intercept:
ols_result_no_intercept <- get_lin_least_sq_model(x, y, intercept = FALSE) model_no_intercept <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + gear + carb - 1, data = mtcars)
WLS with no intercept:
wls_result_no_intercept <- get_lin_least_sq_model(x, y, intercept = FALSE, weighted = TRUE) model_no_intercept <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + gear + carb - 1, data = mtcars) wts <- 1 / lm(abs(model_no_intercept$residuals) ~ model_no_intercept$fitted.values)$fitted.values^2 model_weighted_no_intercept <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + gear + carb - 1, data = mtcars, weights = wts)
Viewing the beta coefficients.
OLS:
ols_result$beta model$coefficients
WLS:
wls_result$beta model_weighted$coefficients
OLS with no intercept:
ols_result_no_intercept$beta model_no_intercept$coefficients
WLS with no intercept:
wls_result_no_intercept$beta model_weighted_no_intercept$coefficients
As shown, the beta coefficients are just about equal to each other. However, if you compare them using the ==
operator, then it does not match the expectations.
ols_result$beta == model$coefficients wls_result$beta == model_weighted$coefficients ols_result_no_intercept$beta == model_no_intercept$coefficients wls_result_no_intercept$beta == model_weighted_no_intercept$coefficients
The beta coefficients from both functions are quite similar, but not exactly equal. They are very similar in value, as seen in the outputs, so they are not significantly different. With a threshold of 0.00001, it is guaranteed that their differences are not significant.
abs(ols_result$beta == model$coefficients) < 0.00001 abs(wls_result$beta == model_weighted$coefficients) < 0.00001 abs(ols_result_no_intercept$beta == model_no_intercept$coefficients) < 0.00001 abs(wls_result_no_intercept$beta == model_weighted_no_intercept$coefficients) < 0.00001
Now, let's evaluate the fitted values and residuals for one of the specific cases instead of all 4. Let's do the WLS model.
WLS fitted values:
unname(wls_result$fitted_values) unname(model_weighted$fitted.values)
# Direct equality unname(wls_result$fitted_values == model_weighted$fitted.values) # Precise comparison to the 0.00001 threshold unname(abs(wls_result$fitted_values- model_weighted$fitted.values) < 0.00001)
WLS residuals:
unname(wls_result$residuals) unname(model_weighted$residuals)
# Direct equality unname(wls_result$residuals == model_weighted$residuals) # Precise comparison to the 0.00001 threshold unname(abs(wls_result$residuals - model_weighted$residuals) < 0.00001)
Once again, you notice that the direct equality fails between the results of each function, but are really precise, and their absolute differences are less than 0.00001, an appropriate threshold for minimizing errors. This is consistent for all the other cases.
Just for one more comparison, let's quickly view the error metrics of the WLS object, and compare the R-squared and adjusted R-squared values.
wls_result$model_eval_metrics summary(model_weighted)
This is perhaps initially a little more difficult to compare, but it is more or less clear how the results produced by the summary of the model_weighted
object, produced by the lm
function, are quite precise to 4 decimal places.
get_lin_least_sq_model
function and default lm
function"Now that we discussed the accuracies of the the different features between the get_lin_least_sq_model
and the default lm
function, including the beta coefficients, the fitted values, the residuals, plus the R-squared and adjusted R-squared values, let's evaluate how efficient the two different functions are.
The first thing to do is load the "bench" library.
## If `bench` library is not yet installed, run the command below to install it # install.packages("bench") library(bench)
Let's continue to work with the same x and y data from the "mtcars" dataset. To start, let's recreate the data just to be safe in case something was changed from before.
x <- mtcars[, c("cyl", "disp", "hp", "drat", "wt", "qsec", "gear", "carb")] y <- mtcars$mpg
Let's create the models/objects with each function for every case again, and this time, let's time how long it takes to create the object and display the beta coefficients, fitted values, and residuals.
OLS:
# Timing of the default "lm" function time1 <- bench::system_time({ model <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + gear + carb, data = mtcars) model$coefficients unname(model$fitted.values) unname(model$residuals) }) # Timing of the "get_lin_least_sq_model" function time2 <- bench::system_time({ ols_result <- get_lin_least_sq_model(x, y) ols_result$beta unname(ols_result$fitted_values) unname(ols_result$residuals) }) # Printing the times of each function time1 time2 # Verifying that the real timing of the "get_lin_least_sq_model" function is less than that of the default "lm" function (time2 < time1)[2]
WLS:
# Timing of the default "lm" function time1 <- bench::system_time({ model <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + gear + carb, data = mtcars) wts <- 1 / lm(abs(model$residuals) ~ model$fitted.values)$fitted.values^2 model_weighted <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + gear + carb, data = mtcars, weights = wts) model_weighted$coefficients unname(model_weighted$fitted.values) unname(model_weighted$residuals) }) # Timing of the "get_lin_least_sq_model" function time2 <- bench::system_time({ wls_result <- get_lin_least_sq_model(x, y, weighted = TRUE) wls_result$beta unname(wls_result$fitted_values) unname(wls_result$residuals) }) # Printing the times of each function time1 time2 # Verifying that the real timing of the "get_lin_least_sq_model" function is less than that of the default "lm" function (time2 < time1)[2]
OLS with no intercept:
# Timing of the default "lm" function time1 <- bench::system_time({ model <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + gear + carb - 1, data = mtcars) model$coefficients unname(model$fitted.values) unname(model$residuals) }) # Timing of the "get_lin_least_sq_model" function time2 <- bench::system_time({ ols_result <- get_lin_least_sq_model(x, y, intercept = FALSE) ols_result$beta unname(ols_result$fitted_values) unname(ols_result$residuals) }) # Printing the times of each function time1 time2 # Verifying that the real timing of the "get_lin_least_sq_model" function is less than that of the default "lm" function (time2 < time1)[2]
WLS with no intercept:
# Timing of the default "lm" function time1 <- bench::system_time({ model_no_intercept <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + gear + carb - 1, data = mtcars) wts <- 1 / lm(abs(model_no_intercept$residuals) ~ model_no_intercept$fitted.values)$fitted.values^2 model_weighted_no_intercept <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + gear + carb - 1, data = mtcars, weights = wts) model_weighted_no_intercept$coefficients unname(model_weighted_no_intercept$fitted.values) unname(model_weighted_no_intercept$residuals) }) # Timing of the "get_lin_least_sq_model" function time2 <- bench::system_time({ wls_result <- get_lin_least_sq_model(x, y, intercept = FALSE, weighted = TRUE) wls_result$beta unname(wls_result$fitted_values) unname(wls_result$residuals) }) # Printing the times of each function time1 time2 # Verifying that the real timing of the "get_lin_least_sq_model" function is less than that of the default "lm" function (time2 < time1)[2]
As seen in all the cases above, the get_lin_least_sq_model
is significantly faster than the default lm
function. This proves that the get_lin_least_sq_model
function from this particular library, "HW4", is really efficient at computing beta coefficients, fitted values, and residuals of any linear least squares method, including OLS, WLS, OLS with no intercept, and WLS with no intercept.
One of the helper functions used for both the get_lin_least_sq_model
function and the lin_least_squares_train_test
function returns the residuals of any fitted values and actual y values passed in to it as parameters. It was initially implemented using C++ (Rcpp), as it is expected that C++ should perform much faster at certain operations or methods than R, but then it was reverted back to R due to running errors.
The following commented-out script could show efficiencies between the two implementations of the function. However, due to running problems encountered while trying to implement the Rcpp functions. It was tested elsewhere, and can always be done, but did not work properly in this vignette. It turned out that the Rcpp implementation of the function is sometimes faster than the R implementation, but not always, and it does not depend on the size of the two vectors. This is not exactly what was expected, but it was certainly worth evaluating how long it takes each implementation of a function to perform a specific task.
# ## If the `Rcpp` library is not yet installed, uncomment and run the command below to install it # # install.packages("Rcpp") # library(Rcpp) # # ## R implementation: # get_residuals_r <- function(fitted_values, y) { # if (length(fitted_values) != length(y)) { # stop("vectors must be of same length") # } # return(y - fitted_values) # } # # ## C++ (Rcpp) implementation: # cppFunction(' # NumericVector get_residuals_cpp(const NumericVector fitted_values, const NumericVector y){ # if (fitted_values.length() != y.length()) stop ("vectors must be of same length"); # return y - fitted_values; # } # ') # # ## To evaluate the efficiency of both implementations of this particular # ## function, let's start out with two simple vectors of fitted values and y, and # ## then print their results to first verify that they are similar. # # fitted_values <- sample(500, 10) # y <- sample(100, 10) # # print(get_residuals_r(fitted_values, y)) # print(get_residuals_cpp(fitted_values, y)) # # ## We just verified that the two implementations of the function produce similar # ## results. Now, let's run some more examples of fitted values and y, of # ## different vector sizes in every case, to see if the Rcpp function performs # ## any faster than the standard R function when generating and returning a # ## vector of residuals. # # ## If `bench` library is not yet installed, uncomment and run the command below to install it # # install.packages("bench") # library(bench) # # ############# # ## Example 1: # ## Creating two random vectors of the same size # vec_size <- 10 # fitted_values <- sample(500, vec_size) # y <- sample(100, vec_size) # # ## Timing of the R-implementation of the "get_residuals" function # time1 <- bench::system_time({ # vec1 <- get_residuals_r(fitted_values, y) # }) # # ## Timing of the Rcpp-implementation of the "get_residuals" function # time2 <- bench::system_time({ # vec2 <- get_residuals_cpp(fitted_values, y) # }) # # ## Printing the times of each function # time1 # time2 # # ## Verifying that the real timing of the C++ implementation is less than that of the default R implementation # (time2 < time1)[2] # # ############# # ## Example 2: # ## Creating two random vectors of the same size # vec_size <- 50 # fitted_values <- sample(500, vec_size) # y <- sample(100, vec_size) # # ## Timing of the R-implementation of the "get_residuals" function # time1 <- bench::system_time({ # vec1 <- get_residuals_r(fitted_values, y) # }) # # ## Timing of the Rcpp-implementation of the "get_residuals" function # time2 <- bench::system_time({ # vec2 <- get_residuals_cpp(fitted_values, y) # }) # # ## Printing the times of each function # time1 # time2 # # ## Verifying that the real timing of the C++ implementation is less than that of the default R implementation # (time2 < time1)[2] # # ############# # ## Example 3: # ## Creating two random vectors of the same size # vec_size <- 100 # fitted_values <- sample(500, vec_size) # y <- sample(100, vec_size) # # ## Timing of the R-implementation of the "get_residuals" function # time1 <- bench::system_time({ # vec1 <- get_residuals_r(fitted_values, y) # }) # # ## Timing of the Rcpp-implementation of the "get_residuals" function # time2 <- bench::system_time({ # vec2 <- get_residuals_cpp(fitted_values, y) # }) # # ## Printing the times of each function # time1 # time2 # # ## Verifying that the real timing of the C++ implementation is less than that of the default R implementation # (time2 < time1)[2] # # ############# # ## Example 4: # ## Creating two random vectors of the same size # vec_size <- 500 # fitted_values <- sample(10000, vec_size) # y <- sample(1000, vec_size) # # ## Timing of the R-implementation of the "get_residuals" function # time1 <- bench::system_time({ # vec1 <- get_residuals_r(fitted_values, y) # }) # # ## Timing of the Rcpp-implementation of the "get_residuals" function # time2 <- bench::system_time({ # vec2 <- get_residuals_cpp(fitted_values, y) # }) # # ## Printing the times of each function # time1 # time2 # # ## Verifying that the real timing of the C++ implementation is less than that of the default R implementation # (time2 < time1)[2] # # ############# # ## Example 5: # ## Creating two random vectors of the same size # vec_size <- 1000 # fitted_values <- sample(10000, vec_size) # y <- sample(12000, vec_size) # # ## Timing of the R-implementation of the "get_residuals" function # time1 <- bench::system_time({ # vec1 <- get_residuals_r(fitted_values, y) # }) # # ## Timing of the Rcpp-implementation of the "get_residuals" function # time2 <- bench::system_time({ # vec2 <- get_residuals_cpp(fitted_values, y) # }) # # ## Printing the times of each function # time1 # time2 # # ## Verifying that the real timing of the C++ implementation is less than that of the default R implementation # (time2 < time1)[2]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.