bess: Best subset selection

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/bess.R

Description

Best subset selection for generalized linear model and Cox's proportional model.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
bess(x, y, family = c("gaussian", "binomial", "cox"),
     method = "gsection", s.min = 1,
     s.max,
     s.list,
     K.max = 20,
     max.steps = 15,
     glm.max = 1e6,
     cox.max = 20,
	 factor = NULL,
     epsilon = 1e-4,
	 weights=rep(1,nrow(x)))

Arguments

x

Input matrix,of dimension n x p; each row is an observation vector.

y

Response variable,of length n. For family="binomial" should be a factor with two levels. For family="cox", y should be a two-column matrix with columns named 'time' and 'status'.

family

One of the GLM or Cox models. Either "gaussian", "binomial", or "cox", depending on the response.

method

Methods tobe used to select the optimal model size. For method = "sequential", we solve the best subset selection problem for each s in 1,2,…,s_{max}. At each model size s, we run the bess function with a warm start from the last solution with model size s-1. For method = "gsection", we solve the best subset selection problem with a range non-coninuous model sizes.

s.min

The minimum value of model sizes. Only used for method = "gsection". Default is 1.

s.max

The maximum value of model sizes. Only used for method = "gsection". Default is \min{p, n/\log(n)}.

s.list

A list of sequential value representing the model sizes. Only used for method = "sequential".Default is (1,\min{p, n/\log(n)}).

K.max

The maximum iterations used for method = "gsection"

max.steps

The maximum number of iterations in bess function. In linear regression, only a few steps can gurantee the convergence. Default is 15.

glm.max

The maximum number of iterations for solving the maximum likelihood problem on the active set at each step in the primal dual active set algorithm.Only used in the logistic regression for family="binomial". Default is 1e6.

cox.max

The maximum number of iterations for solving the maximum partial likelihood problem on the active set at each step in the primal dual active set algorithm. Only used in Cox's model for family="cox". Default is 20.

factor

Which variable to be factored. Should be NULL or a numeric vector.

epsilon

The tolerance for an early stoping rule in the method "sequential". The early stopping rule is defined as \|Y-Xβ\|/n ≤q ε.

weights

Observation weights. Default is 1 for each observation

Details

The best subset selection problem with model size s is

\min_β -2 logL(β) \;\;{\rm s.t.}\;\; \|β\|_0 ≤q s.

In the GLM case, logL(β) is the log-likelihood function; In the Cox model, logL(β) is the log parital likelihood function.

For each candiate model size, the best subset selection problem is solved by the primal dual active set(PDAS) algorithm, see Wen et al(2017) for details. This algorithm utilizes an active set updating strategy via primal and dual vairables and fits the sub-model by exploiting the fact that their support set are non-overlap and complementary. For the case of method = "sequential", we run the PDAS algorithm for a list of sequential model sizes and use the estimate from last iteration as a warm start. For the case of method = "gsection", a golden section search technique is adopted to efficiently determine the optimal model size.

Value

A list with class attribute 'bess' and named components:

family

Types of the model: "bess_gaussian" for linear model,"bess_binomial" for logistic model and "bess_cox" for Cox model.

beta

The best fitting coefficients of size s=0,1,…,p with the smallest loss function.

lambda

The lambda value in the Lagrangian form of the best subset selection problem with model size of s.

bestmodel

The best fitted model, the class of which is "lm", "glm" or "coxph"

deviance

The value of -2\times logL.

nulldeviance

The value of -2\times logL for null model.

AIC

The value of -2\times logL + 2 \|β\|_0.

BIC

The value of -2\times logL+ log(n) \|β\|_0.

EBIC

The value of -2\times logL+ (log(n)+2\times log(p)) \|β\|_0.

factor

Which variable to be factored. Should be NULL or a numeric vector.

Author(s)

Canhong Wen, Aijun Zhang, Shijie Quan, and Xueqin Wang.

References

Wen, C., Zhang, A., Quan, S. and Wang, X. (2020). BeSS: An R Package for Best Subset Selection in Linear, Logistic and Cox Proportional Hazards Models, Journal of Statistical Software, Vol. 94(4). doi:10.18637/jss.v094.i04.

See Also

bess.one, plot.bess, predict.bess.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
#--------------linear model--------------#
# Generate simulated data
n <- 500
p <- 20
K <-10
sigma <- 1
rho <- 0.2
data <- gen.data(n, p, family = "gaussian", K, rho, sigma)

# Best subset selection
fit1 <- bess(data$x, data$y, family = "gaussian")
print(fit1)
#coef(fit1, sparse=TRUE)  # The estimated coefficients
bestmodel <- fit1$bestmodel
#summary(bestmodel)

# Plot solution path and the loss function
plot(fit1, type = "both", breaks = TRUE)

## Not run:
#--------------logistic model--------------#

# Generate simulated data
data <- gen.data(n, p, family="binomial", 5, rho, sigma)

# Best subset selection
fit2 <- bess(data$x, data$y, s.list = 1:10, method = "sequential",
             family = "binomial", epsilon = 0)
print(fit2)
#coef(fit2, sparse = TRUE)
bestmodel <- fit2$bestmodel
#summary(bestmodel)

# Plot solution path and the loss function
plot(fit2, type = "both", breaks = TRUE, K = 5)

#--------------cox model--------------#

# Generate simulated data
data <- gen.data(n, p, 5, rho, sigma, c = 10, family = "cox", scal = 10)

# Best subset selection
fit3 <- bess(data$x, data$y, s.list = 1:10, method = "sequential",
             family = "cox")
print(fit3)
#coef(fit3, sparse = TRUE)
bestmodel <- fit3$bestmodel
#summary(bestmodel)

# Plot solution path and the loss function
plot(fit3, type = "both", breaks = TRUE, K = 5)


#----------------------High dimensional linear models--------------------#

p <- 1000
data <- gen.data(n, p, family = "gaussian", K, rho, sigma)

# Best subset selection
fit <- bess(data$x, data$y, method="sequential", family = "gaussian", epsilon = 1e-12)

# Plot solution path
plot(fit, type = "both", breaks = TRUE, K = 10)


data("prostate")
x = prostate[,-9]
y = prostate[,9]

fit.group = bess(x, y, s.list = 1:ncol(x), factor = c("gleason"))


#---------------SAheart---------------#
data("SAheart")
y = SAheart[,5]
x = SAheart[,-5]
x$ldl[x$ldl<5] = 1
x$ldl[x$ldl>=5&x$ldl<10] = 2
x$ldl[x$ldl>=10] = 3

fit.group = bess(x, y, s.list = 1:ncol(x), factor = c("ldl"), family = "binomial")
## End(Not run)

Example output

     Df          MSE        AIC        BIC        GIC
[1,]  1 1.017382e+04 4615.78645 4620.00106 4619.25936
[2,] 20 9.811569e-01   30.48857  114.78073   99.94679
[3,] 13 9.829261e-01   17.38931   72.17922   62.53716
[4,]  8 1.133808e+01 1230.08350 1263.80037 1257.86679
[5,] 11 9.905317e-01   17.24328   63.60397   55.44531
[6,] 10 9.966734e-01   18.33391   60.47999   53.06302
[7,]  9 1.653435e+00  269.42739  307.35886  300.68359
      Df      Dev      AIC      BIC      GIC
 [1,]  1 537.2026 539.2026 543.4172 542.6755
 [2,]  2 459.3700 463.3700 471.7992 470.3158
 [3,]  3 355.9026 361.9026 374.5465 372.3214
 [4,]  4 222.1825 230.1825 247.0409 244.0741
 [5,]  5 125.0304 135.0304 156.1035 152.3950
 [6,]  6 121.9117 133.9117 159.1993 154.7491
 [7,]  7 116.6141 130.6141 160.1163 154.9245
 [8,]  8 114.6736 130.6736 164.3905 158.4569
 [9,]  9 113.9491 131.9491 169.8806 163.2053
[10,] 10 111.5194 131.5194 173.6655 166.2485
censoring rate: 0.146 
      Df      Dev      AIC      BIC      GIC
 [1,]  1 4311.034 4313.034 4317.248 4316.507
 [2,]  2 4127.956 4131.956 4140.385 4138.902
 [3,]  3 3977.746 3983.746 3996.390 3994.165
 [4,]  4 3444.053 3452.053 3468.911 3465.944
 [5,]  5 2618.090 2628.090 2649.163 2645.454
 [6,]  6 2611.283 2623.283 2648.571 2644.121
 [7,]  7 2606.873 2620.873 2650.376 2645.184
 [8,]  8 2604.756 2620.756 2654.472 2648.539
 [9,]  9 2602.870 2620.870 2658.801 2652.126
[10,] 10 2601.346 2621.346 2663.492 2656.075

BeSS documentation built on April 22, 2021, 9:08 a.m.

Related to bess in BeSS...