knitr::opts_chunk$set( eval = FALSE, collapse = TRUE, comment = "#>" )
To install the latest development version of BHAM
package from GitHub, type the following command in R console:
if (!require(devtools)) { install.packages("devtools") } devtools::install_github("boyiguo1/BHAM", build_vignettes = FALSE)
You can also set build_vignettes=TRUE
but this will slow down the installation drastically (the vignettes can always be accessed online anytime at boyiguo1.github.io/BHAM/articles).
In this section, we describe to the users the general sense of the package. We introduce how to 1) prepare the high-dimensional design matrix for fitting the proposed model, 2) fit generalized additive model and Cox proportional hazard additive model, 3) tune the model and assess the performance, and 4) visualize the bi-level variable selection.
library(BHAM)
We use the spline functions and ancillary functions from the package mgcv
to create spline data
library(mgcv)
We write a function that follows the data generating equation from Bai (citation). The outcome distribution could come from one of Gaussian, Poisson, or Binomial distributions. Please see (function link) here. Note here, the data output from the the simulation function is "raw" data. In order to have the final design matrix for analysis, you need to set up the spline functions of your preference via a data frame (see example in ), and use the function construct_smooth_data
.
set.seed(1) ## simulate some data... n_train <- 500 n_test <- 1000 p <- 10 # Train Data train_dat <- sim_Bai(n_train, p) dat <- train_dat$dat %>% data.frame theta <- train_dat$theta # Test Data test_tmp <- sim_Bai(n_test, p) test_dat <- test_tmp$dat %>% data.frame test_theta <- test_tmp$theta
Since our models are primarily designed for high-dimensional applications. Hence, we are mindful about how to formulate the regression equation. Our solution here is to use a data frame which contains the necessary components of a spline function, in the format of Var
Func
, Args
, which later passed to a parser function, which will assemble the regression equation for you. We believe such automation process, implemented in construct_smooth_data
, will ease the task load to set up the regression formula, when the number of spline components grows to tens or hundreds.
# Low-dimensional setting mgcv_df <- dplyr::tribble( ~Var, ~Func, ~Args, "X1", "s", "bs='cr', k=5", "X2", "s", NA, "X3", "s", "", ) # High-dimensional setting mgcv_df <- data.frame( Var = setdiff(names(dat), "y"), Func = "s", Args ="bs='cr', k=7" )
We use smoothCon
from the package mgcv
to construct the . Hence, our construct_smooth_data
have the potential to work with user-defined spline functions as long as it follows mgcv
standard.
In our construct_smooth_data
, we have taken multiple reparameterization steps via smoothCon
: 1) linear constraints, 2) eigen decomposition of the smoothing matrix $S$ to isolate null and penalized spaces, 3) scaling of the design matrix to have unit variance for each column of the data set, i.e. bases of the design matrix.
train_sm_dat <- construct_smooth_data(mgcv_df, dat) train_smooth_data <- train_sm_dat$data
Due to the reparameterization in the design matrix, it creates complication with setting up the design matrix for the test data. Hence, we write an wrapping function for PredictMat
from mgcv
.
train_smooth <- train_sm_dat$Smooth test_sm_dat <- make_predict_dat(train_sm_dat$Smooth, dat = test_dat)
The model fitting function is similar to
# # lasso_mdl <- cv.glmnet(train_smooth_data %>% as.matrix, dat$y, family = "binomial") # lasso_mdl <- glmnet(train_smooth_data %>% as.matrix, dat$y, family = "binomial", lambda=lasso_mdl$lambda.min) # # mdl1_scale <- bgam(y~.-y, data = data.frame(scale(train_smooth_data), y = dat$y), family = "binomial", prior = mde(), group = make_group(names(train_smooth_data))) # # # mdl3 <- bgam(y~.-y, data = data.frame(train_smooth_data, y = dat$y), family = "binomial", prior = mt(df=Inf), group = make_group(names(train_smooth_data), penalize_null = FALSE)) # # # mdl2 <- bamlasso(x = train_smooth_data, y = dat$y, family = "binomial", group = make_group(names(train_smooth_data)))
s0_seq <- seq(0.005, 0.1, 0.01)
set.seed(1) tmp <- cv.bgam(mdl1) # Validation # set.seed(1) # tmp2 <- cv.bh(mdl1) res <- tune.bgam(mdl1, s0 = s0_seq) # res <- map_dfr(s0_seq, .f = function(.s0, .mdl){ # set.seed(1) # tmp <- cv.bgam(.mdl, s0 = .s0) # tmp$measures %>% t() %>% data.frame # }, # .mdl = mdl1) %>% # data.frame(s0 = s0_seq,.)
set.seed(1) tmp <- cv.bgam(mdl2) # set.seed(1) # tmp2 <- cv.bh(mdl2) res <- tune.bgam(mdl2, s0 = s0_seq)
Many of the functionalities in the R package BhGLM
can be directly applied to the bgam
and bamlasso
models, for example cross validations
if (!requireNamespace("devtools", quietly = TRUE)) install.packages("devtools") library(devtools) if (!requireNamespace("BhGLM", quietly = TRUE)) install_github("boyiguo1/BhGLM@spline") library(BhGLM) if (!requireNamespace("pROC", quietly = TRUE)) install.packages("pROC") cv.bh(mdl1)
df.adj(mdl1, names(train_smooth_data)[grep("x2[.]pen", names(train_smooth_data))]) df.adj(mdl1, names(train_smooth_data)[grep("x20[.]pen", names(train_smooth_data))]) calculate_EDF(mdl1, names(train_smooth_data)[grep("x2[.]pen", names(train_smooth_data))]) calculate_EDF(mdl1, names(train_smooth_data)[grep("x20[.]pen", names(train_smooth_data))]) calculate_EDF(mdl1, names(train_smooth_data)[grep("x1[.]null", names(train_smooth_data))])
=======
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.