knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

elnglm is a self-developed package that fits the generalized linear model (GLM) with elastic net penalty and predicts the response based on the fitted GLM. This package is developed by R codes, and, accelerated using C++ codes through Rcpp and RcppArmadillo packages. After the manual section, a report showing comparison of computer costs of R and Armadillo engines of this package is attached.

How to Use elnglm

Installation

The elnglm package is available on the Github repository \url{https://github.com/PingYangChen/elnglm}. The user can install elnglm in R by the command of the devtools package.

install.packages("devtools")
devtools::install_github("PingYangChen/elnglm")

Then, activate elnglm by the standard command in R.

library(elnglm)

Simulation Data Generator

Use glmDataGen to generate simulation data with continuous response (family = "gaussian"), binary response (family = "binomial") and multi-categorical response (family = "multinomial").

For continuous and binary types of response variable, input the real-valued intercept trueb0 and real-valued coefficient vector trueb. The length of trueb should be equal to the dimension of the predictor matrix, d. The continuous response variable are generated based on Normal error assumption, and hence, the user needs to specify the stanard deviation of the error term, s.

set.seed(99)
trueb0 <- 1
trueact <- c(1, 1, 1, 0, 0, 0, 0, 0, 0, 0)
trueb <- runif(10, -1, 1)*10
trueb[which(trueact == 0)] <- 0 
# Continuous response
df <- glmDataGen(n = 500, d = 10, family = "gaussian", trueb0, trueb, s = 1, seed = 100)
# Binary response
dfb <- glmDataGen(n = 500, d = 10, family = "binomial", trueb0, trueb, seed = 100)

For multi-categorical response variable, the funciton simulates the data based on multi-logistic model. Suppose there are $K$ categories, the user needs to input $K$ sets of regression coefficients. The input of intercept term is a real-valued vector of length $K$ and the input of the regression coefficients is a $d\times K$ matrix. See details in the "Theoretical Development" section.

set.seed(99)
trueb0m <- c(1, 1, 1)
trueactm <- cbind(
  c(1, 1, 1, 0, 0, 0, 0, 0, 0, 0),
  c(0, 0, 0, 1, 1, 1, 1, 0, 0, 0),
  c(0, 0, 0, 0, 0, 1, 1, 1, 1, 0)
)
truebm <- matrix(runif(10*3, -1, 1)*10, 10, 3)
for (m in 1:3) { truebm[which(trueactm[,m] == 0),m] <- 0 }
#
dfm <- glmDataGen(500, 10, family = "multinomial", trueb0m, truebm, seed = 100)

Modelling and Prediction

Use glmPenaltyFit to fit the generalized linear model (GLM) with elastic net penalty. The function supports the GLM of continuous response (family = "gaussian"), binary response (family = "binomial") and multi-categorical response (family = "multinomial"). The function fits GLMs for different penalty parameters as the same time. The vector of penalty parameters can be automatically generated, from the theoretically obtained maximal value, $\lambda_{max}$, to its minLambdaRatio-times multiple, of length lambdaLength. Alternatively, the function accepts user-defined input lambda vector via lambdaVec.

mdl <- glmPenaltyFit(y = df$y, x = df$x, family = "gaussian", lambdaLength = 100, 
                     maxit = 1e5, tol = 1e-7, alpha = 0.5, minLambdaRatio = 1e-3, 
                     ver = "arma")
colorlist <- sapply(1:nrow(mdl$b), function(k) {
  val <- runif(3); rgb(val[1], val[2], val[3])
})
plot(mdl$lambda, mdl$b[1,], type = "l", col = colorlist[1], ylim = range(mdl$b),
     xlab = "lambda", ylab = "Coefficient Value")
for (i in 2:nrow(mdl$b)) {
  points(mdl$lambda, mdl$b[i,], type = "l", col = colorlist[i])
}
legend("topright", legend = sprintf("b%d", 1:nrow(mdl$b)), ncol = 3,
       col = colorlist, lty = 1, cex = 0.6)

The function glmPenaltyCV supports the tuning for the penalty parameter via k-fold cross validation.

mdlcv <- glmPenaltyCV(y = df$y, x = df$x, family = "gaussian", lambdaLength = 100, 
                      maxit = 1e5, tol = 1e-7, alpha = 0.5, minLambdaRatio = 1e-3,
                      nfolds = 10, ver = "arma")
plot(mdlcv$lambda, mdlcv$cvscore, type = "l", xlab = "lambda", ylab = "RMSE Value")
# Intercept of the linear model
mdlcv$b0[mdlcv$lambdaBestId]
# Regression coefficients of the linear model
mdlcv$b[,mdlcv$lambdaBestId]

The function glmPenaltyPred predicts the response for the new predictor.

# Predict for new data
xnew <- matrix(rnorm(10), 1, 10)
yp <- glmPenaltyPred(mdlcv, xnew)
dim(yp) # 1 100
print(yp[,mdlcv$lambdaBestId])

Below is the GLM with elastic net penalty fitting for the binary response.

mdlcv_b <- glmPenaltyCV(y = dfb$y, x = dfb$x, family = "binomial", lambdaLength = 100,
                        maxit = 1e5, tol = 1e-7, alpha = 0.5, minLambdaRatio = 1e-3, 
                        nfolds = 10, ver = "arma")
plot(mdlcv_b$lambda, exp(-mdlcv_b$cvscore), type = "l", xlab = "lambda", ylab = "Accuracy Value")
# Intercept of the logistic model
mdlcv_b$b0[mdlcv_b$lambdaBestId]
# Regression coefficients of the logistic model
mdlcv_b$b[,mdlcv_b$lambdaBestId]
# Predict for new data
xnew <- matrix(rnorm(10), 1, 10)
#
yp <- glmPenaltyPred(mdlcv_b, xnew, type = "response")
print(yp[,mdlcv_b$lambdaBestId])
#
yp <- glmPenaltyPred(mdlcv_b, xnew, type = "probability")
print(yp[,mdlcv_b$lambdaBestId])
#
yp <- glmPenaltyPred(mdlcv_b, xnew, type = "link")
print(yp[,mdlcv_b$lambdaBestId])

Below is the GLM with elastic net penalty fitting for the multi-categorical response.

mdlcv_m <- glmPenaltyCV(y = dfm$y, x = dfm$x, family = "multinomial", lambdaLength = 100, 
                        maxit = 1e5, tol = 1e-7, alpha = 0.5, minLambdaRatio = 1e-3, 
                        nfolds = 10, ver = "arma")
# Intercept of the multinomial model
mdlcv_m$b0[,mdlcv_m$lambdaBestId]
# Regression coefficients of the multinomial model
mdlcv_m$b[,,mdlcv_m$lambdaBestId]
# Predict for new data
xnew <- matrix(rnorm(10), 1, 10)
#
yp <- glmPenaltyPred(mdlcv_m, xnew, type = "response")
print(yp[,,mdlcv_m$lambdaBestId])
#
yp <- glmPenaltyPred(mdlcv_m, xnew, type = "probability")
print(yp[,,mdlcv_m$lambdaBestId])
#
yp <- glmPenaltyPred(mdlcv_m, xnew, type = "link")
print(yp[,,mdlcv_m$lambdaBestId])

Performace Test

A simple test of this self-developed elnglm package is conducted.

library(glmnet)
# Set for true values of the regression coefficients
trueb0 <- 1
trueb <- c(6, -6, 2, -0.9, 0, 0, 0, 0, 0, 0)
nRep <- 50
testResult <- list(
  "elnglm_r" = list(
    "estb" = matrix(0, nRep, length(trueb)), 
    "cputime" = rep(0, length(trueb))
  ),
  "elnglm_arma" = list(
    "estb" = matrix(0, nRep, length(trueb)), 
    "cputime" = rep(0, length(trueb))
  ),
  "glmnet" = list(
    "estb" = matrix(0, nRep, length(trueb)), 
    "cputime" = rep(0, length(trueb))
  )
)

for (iRep in 1:nRep) {
  # Generate simulation data
  df <- glmDataGen(500, 10, family = "gaussian", trueb0, trueb, s = 1, 
                   seed = iRep)
  # Run cv for self-developed elnglm using R engine
  cpu0 <- system.time({
    mdlSelf_r <- glmPenaltyCV(y = df$y, x = df$x, family = "gaussian", 
                              lambdaLength = 100, maxit = 1e5, tol = 1e-7,
                              alpha = 0.5, minLambdaRatio = 1e-3, nfold = 5, 
                              ver = "r")
  })
  testResult$elnglm_r$estb[iRep,] <- mdlSelf_r$b[,mdlSelf_r$lambdaBestId]
  testResult$elnglm_r$cputime[iRep] <- cpu0

  # Run cv for self-developed elnglm using armadillo engine
  cpu1 <- system.time({
    mdlSelf_arma <- glmPenaltyCV(y = df$y, x = df$x, family = "gaussian", 
                                 lambdaLength = 100, maxit = 1e5, tol = 1e-7,
                                 alpha = 0.5, minLambdaRatio = 1e-3, nfold = 5, 
                                 ver = "arma")
  })
  testResult$elnglm_arma$estb[iRep,] <- mdlSelf_arma$b[,mdlSelf_arma$lambdaBestId]
  testResult$elnglm_arma$cputime[iRep] <- cpu1

  # Run cv for the glmnet package
  cpu2 <- system.time({
    mdlPkg <- cv.glmnet(y = df$y, x = df$x, family = "gaussian", 
                        lambda = mdlSelf_arma$lambda, maxit = 1e5, thresh = 1e-7,
                        alpha = 0.5, type.measure = "mse", nfolds = 5)
  })
  testResult$glmnet$estb[iRep,] <- mdlPkg$glmnet.fit$beta[,which.min(mdlPkg$cvm)]
  testResult$glmnet$cputime[iRep] <- cpu2
}
trueact <- abs(trueb) > 0
#
elnglm_r_act <- abs(testResult$elnglm_r$estb) > 0
elnglm_arma_act <- abs(testResult$elnglm_arma$estb) > 0
glmnet_act <- abs(testResult$glmnet$estb) > 0
#
sumTable <- rbind(
  c(
    sapply(1:length(trueact), function(j) 100*mean(elnglm_r_act[,j] == trueact[j])),
    mean(testResult$elnglm_r$cputime)
  ),
  c(
    sapply(1:length(trueact), function(j) 100*mean(elnglm_arma_act[,j] == trueact[j])),
    mean(testResult$elnglm_arma$cputime)
  ),
  c(
    sapply(1:length(trueact), function(j) 100*mean(glmnet_act[,j] == trueact[j])),
    mean(testResult$glmnet$cputime)
  )
)
dimnames(sumTable) <- list(
  c("elnglm_r", "elnglm_arma", "glmnet"), 
  c(sprintf("x%02d", 1:length(trueact)), "CPU Time (s)")
)
#
# Accuracy values, %, of identifying active/inactive status of the covariates
# and computing time, in seconds, of the functions.
print(sumTable)

Theoretical Development

Let $(\mathbf{y}, \mathbf{X})$ be the data of $n$ observations, where $\mathbf{y}\in\mathbb{R}^n$ is the vector of the response variable and $\mathbf{X}\in\mathbb{R}^{n\times p}$ is the matrix of $p$ covariates. The linear model is the common approach to describe the linear associations between the covariates and the response variable. For a given $\mathbf{x}$ and proper link function $g(\cdot)$, the expected response has the form of the linear combination of the covariates: $$ g\left[E(Y\mid\mathbf{x})\right] = \beta_0 + \sum_{j=1}^p x_j\beta_j $$ where $\beta_0$ is the intercept term and $\boldsymbol\beta=(\beta_1, \ldots, \beta_p)$ are the regression coefficients.

(TBW)

Suppose $Y$ is continuous response variable. The elastic net estimates of $\beta_0$ and $\boldsymbol\beta$ minimize the following objective function $$ \underset{\beta_0, \boldsymbol\beta}{\min} \frac{1}{2n}\sum_{i=1}^n \left( y_i - \beta_0 - x_i^\top\boldsymbol\beta \right)^2 + \lambda \sum_{j=1}^p\left[ \frac{1-\alpha}{2} \beta_j^2 + \alpha |\beta_j| \right] $$ where $\alpha$ is the elastic net mixing parameter between the ridge regression penalty ($\alpha = 0$) and lasso ($\alpha = 1$) penalty; $\lambda$ is the shrinkage parameter.

For binary responses, $Y\in{0,1}$, the linear logistic regression is considered. That is, $$ p(x)=\mathcal{Pr}(Y = 1\mid x) = \frac{1}{1 + e^{-(\beta_0 + x^\top\boldsymbol\beta)}}, $$ or equivalently, $$ \log{\frac{p(x)}{1-p(x)}}=\beta_0 + x^\top\boldsymbol\beta $$

The elastic-net logistic regression estimates of $\beta_0$ and $\boldsymbol\beta$ maximize the penalized likelihood function $$ \underset{\beta_0, \boldsymbol\beta}{\max} \frac{1}{n}\sum_{i=1}^n \Big[ I{y_i=0}\log{{1-p(x_i)}} + I{y_i=1}\log{p(x_i)} \Big] + \lambda \sum_{j=1}^p\left[ \frac{1-\alpha}{2} \beta_j^2 + \alpha |\beta_j| \right] $$ When the response variable has multiple categories, the multinomial logistic model is used. Suppose there are $K$ categories, ${0, 1, \ldots, K-1}$, the probability that $Y$ belongs to the $k$-th category is $$ p_k(x)=\mathcal{Pr}(Y = k\mid x) = \frac{e^{\beta_{0k} + x^\top\boldsymbol\beta_k}}{\sum_{l=0}^{K-1} e^{\beta_{0\mathcal{l}} + x^\top\boldsymbol\beta_\mathcal{l}}}, $$ where $\beta_{0k}$ and $\boldsymbol\beta_k=(\beta_{1k}, \beta_{2k}, \ldots, \beta_{pk})$ are the coefficients of the $k$-th logistic model.

The elastic-net estimates of the multinomial regression coefficients maximize the penalized likelihood function $$ \underset{\beta_{0k}, \boldsymbol\beta_k, k=0,\ldots,K-1}{\max} \frac{1}{n}\sum_{i=1}^n \Big[ \sum_{k=0}^{K-1}\log{p_k(x_i)} \Big] + \lambda \sum_{j=1}^p\sum_{k=0}^{K-1}\left[ \frac{1-\alpha}{2} \beta_{jk}^2 + \alpha |\beta_{jk}| \right]. $$



PingYangChen/elnglm documentation built on Jan. 6, 2022, 5:50 a.m.