SODA algorithm for variable and interaction selection

Share:

Description

SODA is a forward-backward variable and interaction selection algorithm under logistic regression model with second-order terms. In the forward stage, a stepwise procedure is conducted to screen for important predictors with both main and interaction effects, and in the backward stage SODA remove insignificant terms so as to optimize the extended BIC (EBIC) criterion. SODA is applicable for variable selection for logistic regression, linear/quadratic discriminant analysis and other discriminant analysis with generative model being in exponential family.

Usage

1
soda(xx, yy, norm = FALSE, debug = FALSE, gam = 0, minF = 3)

Arguments

xx

The design matrix, of dimensions n * p, without an intercept. Each row is an observation vector.

yy

The response vector of dimension n * 1.

norm

Logical flag for xx variable quantile normalization to standard normal, prior to performing SODA algorithm. Default is norm=FALSE. Quantile-normalization is suggested if the data contains obvious outliers.

debug

Logical flag for printing debug information.

gam

Tuning paramter gamma in extended BIC criterion.

EBIC for selected set S:

EBIC = -2 * log-likelihood + |S| * log(n) + 2 * |S| * gamma * log(p)

minF

Minimum number of steps in forward interaction screening. Default is minF=3.

Value

EBIC

Trace of extended Bayesian information criterion (EBIC) score.

Type

Trace of step type ("Forward (Main)", "Forward (Int)", "Backward").

Var

Trace of selected variables.

Term

Trace of selected main and interaction terms.

final_EBIC

Final selected term set EBIC score.

final_Var

Final selected variables.

final_Term

Final selected main and interaction terms.

Author(s)

Yang Li, Jun S. Liu

References

Li Y, Liu JS. (2015). Robust variable and interaction selection for high-dimensional classification via logistic regression. Technical Report.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# simulation study with 1 main effect and 2 interactions (uncomment the code to run)
#N = 250;
#p = 1000;
#r = 0.5;
#s = 1;
#H = abs(outer(1:p, 1:p, "-"))
#S = s * r^H;
#S[cbind(1:p, 1:p)] = S[cbind(1:p, 1:p)] * s

#xx = as.matrix(data.frame(mvrnorm(N, rep(0,p), S)));
#zz = 1 + xx[,1] - xx[,10]^2 + xx[,10]*xx[,20];
#yy = as.numeric(runif(N) < exp(zz) / (1+exp(zz)))

#res_SODA = soda(xx, yy, gam=0.5);
#cv_SODA  = soda_trace_CV(xx, yy, res_SODA)
#cv_SODA

# Michigan lung cancer dataset (uncomment the code to run)
#data(mich_lung);
#res_SODA = soda(mich_lung_xx, mich_lung_yy, gam=0.5);
#cv_SODA  = soda_trace_CV(mich_lung_xx, mich_lung_yy, res_SODA)
#cv_SODA