knitr::opts_chunk$set(echo = TRUE) library(bench) library(ggbeeswarm) library(gmp) library(tidyr) library(lmpackage)
This package "lmpackage" is for simulating the output of lm() function. In the output, it will return the essential outputs of lm() such as coefficients, fitted_values. In the package, "linear_regression: is the main funciton that returns the result of linear regression. "model_fit" is an assistant function for model fitting. Besides, a function named "simulate_data" is created to generate test datas. In this vignette file, the correctness of the lmpackage output and the efficiency of the package with the lm() function using the random testing data generated by "simulate_data" function.
To compare the correctness of our package with the lm() package. We randomly created a simulated data set consisting of 500*4 data by using our data-generating function "simulate_data". The data frame includes 4 columns of data named V1, V2, V3, V4, generated randomly. The V1 column is outcome, to guarantee a linear relationship, here we generated V1 by linear combination of V2, V3, and V4, and added a random error at the end. Also, where are random (weights) and (offsets) in the data, for the next test.
datasize1 = 500 datasize2 = 3 data_range = matrix(c(0,0,0,1,1,1),3,2) offsets_range = c(0,1) weights_range = c(0,1) coefficients = c(1,2,3) data = simulate_data(datasize1, datasize2, error_range = c(1,2), data_range, offsets_range, weights_range, coefficients) m = lm(V1~V2+V3+V4, data) m1 = linear_regression(V1~V2+V3+V4, data)
Because we mainly interested in some core outcomes of lm() such as coefficients, fitted_values. Firstly, We examined the most basic function via all.equal() function. We can see that, the fitted.values, coefficients, residuals, effects, rank, and df.residuals are all equal in the two models, suggesting a correct outcome for our simulation. Secondly, we created m0.1 and m1.1 as models without intercept. The results also showed that they are equal.
all.equal(m1$fitted.values, unname(m$fitted.values)) all.equal(m1$coefficients, unname(m$coefficients)) all.equal(m1$residuals, unname(m$residuals)) all.equal(m1$effects, unname(m$effects)) all.equal(m1$rank, unname(m$rank)) all.equal(m1$df.residual, unname(m$df.residual)) m0.1 = lm(V1~V2+V3+V4-1, data, x=TRUE, y = TRUE) m1.1 = linear_regression(V1~V2+V3+V4-1, data, x=TRUE, y = TRUE) all.equal(m1.1$fitted.values, unname(m0.1$fitted.values)) all.equal(m1.1$coefficients, unname(m0.1$coefficients)) all.equal(m1.1$residuals, unname(m0.1$residuals)) all.equal(m1.1$effects, unname(m0.1$effects)) all.equal(m1.1$rank, unname(m0.1$rank)) all.equal(m1.1$df.residual, unname(m0.1$df.residual))
Thirdly, we consider adding offsets/ weights into our model. Without loss of generality, we create a vector of length 500 randomly, and compare the output of two functions. As the output shows, the regression results are also all equal.
data$offset = runif(500,0,1) m0.2 = lm(V1~V2+V3+V4, data = data, weights = data$`(weights)`,offset = data$`(offset)`) m1.2 = linear_regression(V1~V2+V3+V4, data = data, weights = data$`(weights)`,offset = data$`(offset)`) all.equal(unname(m1.2$fitted.values), unname(m0.2$fitted.values)) all.equal(m1.2$coefficients, unname(m0.2$coefficients)) all.equal(unname(m1.2$residuals), unname(m0.2$residuals)) all.equal(m1.2$effects, unname(m0.2$effects)) all.equal(m1.2$rank, unname(m0.2$rank)) all.equal(m1.2$df.residual, unname(m0.2$df.residual))
We use bench::mark to compare the time performance between the two methods. From the figure, we can see our lmpackage shows a higher efficiency comparing to the lm() function. From the output, we can also see the median time for the two are about 1ms and 3ms seperaely. This may because lm() function take more functions into consideration than linear_regression here.
formula = "V1~V2+V3+V4" result = bench::mark( lm(formula, data)$rank, linear_regression(formula, data)$rank) print(result) plot(result)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.