bess: Best subset selection In BeSS: Best Subset Selection in Linear, Logistic and CoxPH Models

Description

Best subset selection for generalized linear model and Cox's proportional model.

Usage

  1 2 3 4 5 6 7 8 9 10 11 bess(x, y, family = c("gaussian", "binomial", "cox"), method = "gsection", s.min = 1, s.max, s.list, K.max = 20, max.steps = 15, glm.max = 1e6, cox.max = 20, factor = NULL, epsilon = 1e-4, weights=rep(1,nrow(x))) 

Arguments

 x Input matrix,of dimension n x p; each row is an observation vector. y Response variable,of length n. For family="binomial" should be a factor with two levels. For family="cox", y should be a two-column matrix with columns named 'time' and 'status'. family One of the GLM or Cox models. Either "gaussian", "binomial", or "cox", depending on the response. method Methods tobe used to select the optimal model size. For method = "sequential", we solve the best subset selection problem for each s in 1,2,…,s_{max}. At each model size s, we run the bess function with a warm start from the last solution with model size s-1. For method = "gsection", we solve the best subset selection problem with a range non-coninuous model sizes. s.min The minimum value of model sizes. Only used for method = "gsection". Default is 1. s.max The maximum value of model sizes. Only used for method = "gsection". Default is \min{p, n/\log(n)}. s.list A list of sequential value representing the model sizes. Only used for method = "sequential".Default is (1,\min{p, n/\log(n)}). K.max The maximum iterations used for method = "gsection" max.steps The maximum number of iterations in bess function. In linear regression, only a few steps can gurantee the convergence. Default is 15. glm.max The maximum number of iterations for solving the maximum likelihood problem on the active set at each step in the primal dual active set algorithm.Only used in the logistic regression for family="binomial". Default is 1e6. cox.max The maximum number of iterations for solving the maximum partial likelihood problem on the active set at each step in the primal dual active set algorithm. Only used in Cox's model for family="cox". Default is 20. factor Which variable to be factored. Should be NULL or a numeric vector. epsilon The tolerance for an early stoping rule in the method "sequential". The early stopping rule is defined as \|Y-Xβ\|/n ≤q ε. weights Observation weights. Default is 1 for each observation

Details

The best subset selection problem with model size s is

\min_β -2 logL(β) \;\;{\rm s.t.}\;\; \|β\|_0 ≤q s.

In the GLM case, logL(β) is the log-likelihood function; In the Cox model, logL(β) is the log parital likelihood function.

For each candiate model size, the best subset selection problem is solved by the primal dual active set(PDAS) algorithm, see Wen et al(2017) for details. This algorithm utilizes an active set updating strategy via primal and dual vairables and fits the sub-model by exploiting the fact that their support set are non-overlap and complementary. For the case of method = "sequential", we run the PDAS algorithm for a list of sequential model sizes and use the estimate from last iteration as a warm start. For the case of method = "gsection", a golden section search technique is adopted to efficiently determine the optimal model size.

Value

A list with class attribute 'bess' and named components:

 family Types of the model: "bess_gaussian" for linear model,"bess_binomial" for logistic model and "bess_cox" for Cox model. beta The best fitting coefficients of size s=0,1,…,p with the smallest loss function. lambda The lambda value in the Lagrangian form of the best subset selection problem with model size of s. bestmodel The best fitted model, the class of which is "lm", "glm" or "coxph" deviance The value of -2\times logL. nulldeviance The value of -2\times logL for null model. AIC The value of -2\times logL + 2 \|β\|_0. BIC The value of -2\times logL+ log(n) \|β\|_0. EBIC The value of -2\times logL+ (log(n)+2\times log(p)) \|β\|_0. factor Which variable to be factored. Should be NULL or a numeric vector.

Author(s)

Canhong Wen, Aijun Zhang, Shijie Quan, and Xueqin Wang.

References

Wen, C., Zhang, A., Quan, S. and Wang, X. (2017). BeSS: an R package for best subset selection in linear, logistic and CoxPH models. arXiv: 1709.06254.

bess.one, plot.bess, predict.bess.
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 #--------------linear model--------------# # Generate simulated data n <- 500 p <- 20 K <-10 sigma <- 1 rho <- 0.2 data <- gen.data(n, p, family = "gaussian", K, rho, sigma) # Best subset selection fit1 <- bess(data$x, data$y, family = "gaussian") print(fit1) #coef(fit1, sparse=TRUE) # The estimated coefficients bestmodel <- fit1$bestmodel #summary(bestmodel) # Plot solution path and the loss function plot(fit1, type = "both", breaks = TRUE) ## Not run: #--------------logistic model--------------# # Generate simulated data data <- gen.data(n, p, family="binomial", 5, rho, sigma) # Best subset selection fit2 <- bess(data$x, data$y, s.list = 1:10, method = "sequential", family = "binomial", epsilon = 0) print(fit2) #coef(fit2, sparse = TRUE) bestmodel <- fit2$bestmodel #summary(bestmodel) # Plot solution path and the loss function plot(fit2, type = "both", breaks = TRUE, K = 5) #--------------cox model--------------# # Generate simulated data data <- gen.data(n, p, 5, rho, sigma, c = 10, family = "cox", scal = 10) # Best subset selection fit3 <- bess(data$x, data$y, s.list = 1:10, method = "sequential", family = "cox") print(fit3) #coef(fit3, sparse = TRUE) bestmodel <- fit3$bestmodel #summary(bestmodel) # Plot solution path and the loss function plot(fit3, type = "both", breaks = TRUE, K = 5) #----------------------High dimensional linear models--------------------# p <- 1000 data <- gen.data(n, p, family = "gaussian", K, rho, sigma) # Best subset selection fit <- bess(data$x, data$y, method="sequential", family = "gaussian", epsilon = 1e-12) # Plot solution path plot(fit, type = "both", breaks = TRUE, K = 10) data("prostate") x = prostate[,-9] y = prostate[,9] fit.group = bess(x, y, s.list = 1:ncol(x), factor = c("gleason")) #---------------SAheart---------------# data("SAheart") y = SAheart[,5] x = SAheart[,-5] x$ldl[x$ldl<5] = 1 x$ldl[x$ldl>=5&x$ldl<10] = 2 x$ldl[x$ldl>=10] = 3 fit.group = bess(x, y, s.list = 1:ncol(x), factor = c("ldl"), family = "binomial") ## End(Not run)