knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 6,
  fig.height = 4
)
library(lm.hw4)
library(bench)

The lm.hw4 package aims to estimate and help interpret the linear relations between predictor variables and response variables:

This document introduces you to lm.hw4's basic set of tools.

Data: simulation data

To explore the usage of lm.hw4 package, we use simulation data.

n = 10; p = 3; q = 2;
x = matrix(rnorm(n * p), n, p) # no intercept
y1 = rnorm(n)
y2 = matrix(rnorm(n * q), n, q)

Fit model with lm_fit()

lm_fit() is used to fit linear regression model between a response vector/matrix Y and a predictor matrix X. The function return a list of response values.

Before fitting the model, you have to decide whether a coefficient term should be added to the model. If yes, add 'add.intercept = TRUE' to the function statement.

There are two methods supported, "inv" and "qr".

In the following example, we fit the linear regression model without adding an intercept term using method "qr".

lm_fit(x, y1, add.intercept = FALSE, method = "qr")

Summarize the properties of fitted models with lm_summary.

lm_summary is used to calculate properties of a fitted model. This function only support cases when the response variable is a vector or a single column matrix and the method used to fit the model is "qr". The function return a list of response values.

If you want to calculate the correlation matrix of the coefficients, add 'correlation = TRUE' to the function statement.

z = lm_fit(x, y1, add.intercept = FALSE, method = "qr")
lm_summary(z, correlation = TRUE)

Computes analysis of variance (or deviance) tables for fitted model objects with lm_anova.

lm_anova is used to computes analysis of variance tables for fitted model objects. This function only support cases when the response variable is a vector or a single column matrix and the method used to fit the model is "qr". The function return a list of response values.

lm_anova(z)

Comparisons against the original R functions

Data

# Simulation Data
n = 1000; p = 10;
x = matrix(rnorm(n * p), n, p) # no intercept
y = rnorm(n)

Correctness

Since the methods for estimating the parameters are different, there might be a slightly difference (around 1e-15) between the results of lm.hw4 functions and the original R functions.

Correctness is tested through the unit tests. In unit tests, 3 cases are included: simulation data of small sample size, simulation data of big sample size and real data. Each test case compares the output values of lm.hw4 functions and the original R function. All test cases are passed.

Efficiency

result = bench::mark(lm_fit(x,y, method = "inv"), lm_fit(x,y, method = "qr"), lm.fit(x,y), check = FALSE, filter_gc = FALSE, min_time = 3)
print(result)
plot(result)

From the above results, we can see that the original R function is the most effective method because it use "qr" method and rewrite "qr()" function with Fortan. Compare lm_fit() function using two methods, the implementations using "qr" method are 1 times faster than the implementations using "inv" method because calculating a inverse matrix with high dimensions is time consuming. To improve the effectness of the lm_fit() function, the "qr()" function should be rewrite in C++.

z1 = lm_fit(x,y, method = "qr")
z2 = lm(y~0+x)
result = bench::mark(lm_summary(z1),summary.lm(z2), check = FALSE, filter_gc = FALSE, min_time = 3)
print(result)
plot(result)

From the above results, we can see that lm_summary() function is faster than the original R function.

z1 = lm_fit(x,y, method = "qr")
z2 = lm(y~0+x)
result = bench::mark(lm_anova(z1),anova(z2), check = FALSE, filter_gc = FALSE, min_time = 3)
print(result)
plot(result)

From the above results, we can see that lm_summary() function is faster than the original R function anova. The reason is that the original R function anova is not only designed for linear regression cases, which means that it has to transform and manipulate the linear regression result.



leyaozh/lm.hw4 documentation built on Dec. 3, 2019, 7:18 a.m.