In leyaozh/lm.hw4: Linear Regression

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 6,
  fig.height = 4
)

library(lm.hw4)
library(bench)

The lm.hw4 package aims to estimate and help interpret the linear relations between predictor variables and response variables:

It estimates the parameters of the linear regression model.
It summarizes the properties of fitted models.
It computes analysis of variance (or deviance) tables for fitted model objects.

This document introduces you to lm.hw4's basic set of tools.

Data: simulation data

To explore the usage of lm.hw4 package, we use simulation data.

n = 10; p = 3; q = 2;
x = matrix(rnorm(n * p), n, p) # no intercept
y1 = rnorm(n)
y2 = matrix(rnorm(n * q), n, q)

Fit model with `lm_fit()`

lm_fit() is used to fit linear regression model between a response vector/matrix Y and a predictor matrix X. The function return a list of response values.

Before fitting the model, you have to decide whether a coefficient term should be added to the model. If yes, add 'add.intercept = TRUE' to the function statement.

There are two methods supported, "inv" and "qr".

In the following example, we fit the linear regression model without adding an intercept term using method "qr".

lm_fit(x, y1, add.intercept = FALSE, method = "qr")

Summarize the properties of fitted models with `lm_summary`.

lm_summary is used to calculate properties of a fitted model. This function only support cases when the response variable is a vector or a single column matrix and the method used to fit the model is "qr". The function return a list of response values.

If you want to calculate the correlation matrix of the coefficients, add 'correlation = TRUE' to the function statement.

z = lm_fit(x, y1, add.intercept = FALSE, method = "qr")
lm_summary(z, correlation = TRUE)

Computes analysis of variance (or deviance) tables for fitted model objects with `lm_anova`.

lm_anova is used to computes analysis of variance tables for fitted model objects. This function only support cases when the response variable is a vector or a single column matrix and the method used to fit the model is "qr". The function return a list of response values.

lm_anova(z)

Comparisons against the original R functions

Data

# Simulation Data
n = 1000; p = 10;
x = matrix(rnorm(n * p), n, p) # no intercept
y = rnorm(n)

Correctness

Since the methods for estimating the parameters are different, there might be a slightly difference (around 1e-15) between the results of lm.hw4 functions and the original R functions.

Correctness is tested through the unit tests. In unit tests, 3 cases are included: simulation data of small sample size, simulation data of big sample size and real data. Each test case compares the output values of lm.hw4 functions and the original R function. All test cases are passed.

Efficiency

lm_fit v.s. lm.fit

result = bench::mark(lm_fit(x,y, method = "inv"), lm_fit(x,y, method = "qr"), lm.fit(x,y), check = FALSE, filter_gc = FALSE, min_time = 3)
print(result)
plot(result)

From the above results, we can see that the original R function is the most effective method because it use "qr" method and rewrite "qr()" function with Fortan. Compare lm_fit() function using two methods, the implementations using "qr" method are 1 times faster than the implementations using "inv" method because calculating a inverse matrix with high dimensions is time consuming. To improve the effectness of the lm_fit() function, the "qr()" function should be rewrite in C++.

lm_summary v.s. summary.lm

z1 = lm_fit(x,y, method = "qr")
z2 = lm(y~0+x)
result = bench::mark(lm_summary(z1),summary.lm(z2), check = FALSE, filter_gc = FALSE, min_time = 3)
print(result)
plot(result)

From the above results, we can see that lm_summary() function is faster than the original R function.

lm_anova v.s. anova

z1 = lm_fit(x,y, method = "qr")
z2 = lm(y~0+x)
result = bench::mark(lm_anova(z1),anova(z2), check = FALSE, filter_gc = FALSE, min_time = 3)
print(result)
plot(result)

From the above results, we can see that lm_summary() function is faster than the original R function anova. The reason is that the original R function anova is not only designed for linear regression cases, which means that it has to transform and manipulate the linear regression result.

leyaozh/lm.hw4 documentation built on Dec. 3, 2019, 7:18 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com