smle_select: Elaborative post-screening selection with SMLE

View source: R/smle_select.R

smle_selectR Documentation

Elaborative post-screening selection with SMLE

Description

The features retained after screening are still likely to contain some that are not related to the response. The function smle_select() is designed to further identify the relevant features using SMLE(). Given a response and a set of K features, this function first runs SMLE(fast = TRUE) to generate a series of sub-models with sparsity k varying from k_min to k_max. It then selects the best model from the series based on a selection criterion.

When criterion EBIC is used, users can choose to repeat the selection with different values of the tuning parameter γ, and conduct importance voting for each feature. When vote = T, this function fits all the models with γ specified in gamma_seq and features with frequency higher than vote_threshold will be selected in ID_voted.

Usage

smle_select(object, ...)

## S3 method for class 'sdata'
smle_select(
  object,
  k_min = 1,
  k_max = NULL,
  subset = NULL,
  gamma_ebic = 0.5,
  vote = FALSE,
  keyset = NULL,
  criterion = "ebic",
  codingtype = c("DV", "standard", "all"),
  gamma_seq = c(seq(0, 1, 0.2)),
  vote_threshold = 0.6,
  parallel = FALSE,
  num_clusters = NULL,
  ...
)

## Default S3 method:
smle_select(
  object = NULL,
  Y = NULL,
  X = NULL,
  family = "gaussian",
  keyset = NULL,
  ...
)

## S3 method for class 'smle'
smle_select(object, ...)

Arguments

object

Object of class 'smle' or 'sdata'. Users can also input a response vector and a feature matrix.

...

Further arguments passed to or from other methods.

k_min

The lower bound of candidate model sparsity. Default is 1.

k_max

The upper bound of candidate model sparsity. Default is the number of columns in feature matrix.

subset

An index vector indicating which features (columns of the feature matrix) are to be selected. Not applicable if a 'smle' object is the input.

gamma_ebic

The EBIC tuning parameter, in [0 , 1]. Default is 0.5.

vote

The logical flag for whether to perform the voting procedure. Only available when criterion = "ebic".

keyset

A numeric vector with column indices for the key features that do not participate in feature screening and are forced to remain in the model. See SMLE for details.

criterion

Selection criterion. One of "ebic","bic","aic". Default is "ebic".

codingtype

Coding types for categorical features; for more details see SMLE() documentation.

gamma_seq

The sequence of values for gamma_ebic when vote = TRUE.

vote_threshold

A relative voting threshold in percentage. A feature is considered to be important when it receives votes passing the threshold. Default is 0.6.

parallel

A logical flag to use parallel computing to do voting selection. Default is FALSE. See Details.

num_clusters

The number of compute clusters to use when parallel = TRUE. The default will be 2 times cores detected.

Y

Input response vector (when object = NULL).

X

Input features matrix (when object = NULL).

family

Model assumption; see SMLE() documentation. Default is Gaussian linear.

When input is a 'smle' or 'sdata' object, the same model will be used in the selection.

Details

This function accepts three types of input objects; 1) 'smle' object, as the output from SMLE(); 2) 'sdata' object, as the output from Gen_Data(); 3) other response and feature matrix input by users.

Note that this function is mainly designed to conduct an elaborative selection after feature screening. We do not recommend using it directly for ultra-high-dimensional data without screening.

Value

call

The call that produced this object.

ID_selected

A list of selected features.

coef_selected

Fitted model coefficients.

intercept

Fitted model intercept.

criterion_value

Values of selection criterion for the candidate models with various sparsity.

categorical

A logical flag whether the input feature matrix includes categorical features

ID_pool

A vector containing all features selected during voting.

ID_voted

A vector containing the features selected when vote = T.

CI

Indices of categorical features when categorical = TRUE.

X, Y, family, gamma_ebic, gamma_seq, criterion, vote, codyingtype, vote_threshold are return of arguments passed in the function call.

References

Chen. J. and Chen. Z. (2012). "Extended BIC for small-n-large-p sparse GLM." Statistica Sinica, 22(2), 555-574.

Examples


set.seed(1)
Data<-Gen_Data(correlation = "MA", family = "gaussian")
fit<-SMLE(Y = Data$Y, X = Data$X, k = 20, family = "gaussian")

fit_bic<-smle_select(fit, criterion = "bic")
summary(fit_bic)

fit_ebic<-smle_select(fit, criterion = "ebic", vote = TRUE)
summary(fit_ebic)
plot(fit_ebic)



SMLE documentation built on Jan. 22, 2023, 1:55 a.m.