RaScreen: Variable screening via RaSE.
In RaSEn: Random Subspace Ensemble Classification and Variable Screening

Description Usage Arguments Value References See Also Examples

RaSE is a general framework for variable screening. In RaSE screening, to select each of the B1 subspaces, B2 random subspaces are generated and the optimal one is chosen according to some criterion. Then the selected proportions (equivalently, percentages) of variables in the B1 subspaces are used as importance measure to rank these variables.

RaScreen(
  xtrain,
  ytrain,
  xval = NULL,
  yval = NULL,
  B1 = 200,
  B2 = NULL,
  D = NULL,
  dist = NULL,
  model = NULL,
  criterion = NULL,
  k = 5,
  cores = 1,
  seed = NULL,
  iteration = 0,
  cv = 5,
  scale = FALSE,
  C0 = 0.1,
  kl.k = NULL,
  classification = NULL,
  ...
)

`xtrain`	n * p observation matrix. n observations, p features.
`ytrain`	n 0/1 observatons.
`xval`	observation matrix for validation. Default = `NULL`. Useful only when `criterion` = 'validation'.
`yval`	0/1 observation for validation. Default = `NULL`. Useful only when `criterion` = 'validation'.
`B1`	the number of weak learners. Default = 200.
`B2`	the number of subspace candidates generated for each weak learner. Default = `NULL`, which will set B2 = 20floor(p/D)*.
`D`	the maximal subspace size when generating random subspaces. Default = `NULL`. It means that `D` = min(√ n0, √ n1, p) when `model` = 'qda', and `D` = min(√ n, p) otherwise.
`dist`	the distribution for features when generating random subspaces. Default = `NULL`, which represents the hierarchical uniform distribution. First generate an integer d from 1,...,D uniformly, then uniformly generate a subset with cardinality d.
`model`	the model to use. Default = 'lda' when `classification` = TRUE and 'lm' when `classification` = FALSE. lm: linear regression. Only available for regression. lda: linear discriminant analysis. `lda` in `MASS` package. Only available for classification. qda: quadratic discriminant analysis. `qda` in `MASS` package. Only available for classification. knn: k-nearest neighbor. `knn`, `knn.cv` in `class` package, `knn3` in `caret` package and `knnreg` in `caret` package. logistic: logistic regression. `glmnet` in `glmnet` package. Only available for classification. tree: decision tree. `rpart` in `rpart` package. Only available for classification. svm: support vector machine. If kernel is not identified by user, it will use RBF kernel. `svm` in `e1071` package. randomforest: random forest. `randomForest` in `randomForest` package and `ranger` in `ranger` package. kernelknn: k-nearest neighbor with different kernels. It relies on function `KernelKnn` in `KernelKnn` package. Arguments `method` and `weights_function` are required. Different choices of multiple arguments are available. See documentation of function `KernelKnn` for details.
`criterion`	the criterion to choose the best subspace. Default = 'ric' when `model` = 'lda', 'qda'; default = 'bic' when `model` = 'lm' or 'logistic'; default = 'loo' when `model` = 'knn'; default = 'cv' and set `cv` = 5 when `model` = 'tree', 'svm', 'randomforest'. ric: minimizing ratio information criterion (RIC) with parametric estimation (Tian, Y. and Feng, Y., 2020). Available for binary classification and `model` = 'lda', 'qda', or 'logistic'. nric: minimizing ratio information criterion (RIC) with non-parametric estimation (Tian, Y. and Feng, Y., 2020; ). Available for binary classification and `model` = 'lda', 'qda', or 'logistic'. training: minimizing training error/MSE. Not available when `model` = 'knn'. loo: minimizing leave-one-out error/MSE. Only available when `model` = 'knn'. validation: minimizing validation error/MSE based on the validation data. cv: minimizing k-fold cross-validation error/MSE. k equals to the value of `cv`. Default = 5. aic: minimizing Akaike information criterion (Akaike, H., 1973). Available when `base` = 'lm' or 'logistic'. AIC = -2 * log-likelihood + \|S\| * 2. bic: minimizing Bayesian information criterion (Schwarz, G., 1978). Available when `model` = 'lm' or 'logistic'. BIC = -2 * log-likelihood + \|S\| * log(n). ebic: minimizing extended Bayesian information criterion (Chen, J. and Chen, Z., 2008; 2012). `gam` value is needed. When `gam` = 0, it represents BIC. Available when `model` = 'lm' or 'logistic'. eBIC = -2 * log-likelihood + \|S\| * log(n) + 2 * \|S\| * gam * log(p).
`k`	the number of nearest neightbors considered when `model` = 'knn' or 'kernel'. Only useful when `model` = 'knn' or 'kernel'. `k` is required to be a positive integer. Default = 5.
`cores`	the number of cores used for parallel computing. Default = 1.
`seed`	the random seed assigned at the start of the algorithm, which can be a real number or `NULL`. Default = `NULL`, in which case no random seed will be set.
`iteration`	the number of iterations. Default = 0.
`cv`	the number of cross-validations used. Default = 5. Only useful when `criterion` = 'cv'.
`scale`	whether to normalize the data. Logistic, default = FALSE.
`C0`	a positive constant used when `iteration` > 1. See Tian, Y. and Feng, Y., 2021 for details. Default = 0.1.
`kl.k`	the number of nearest neighbors used to estimate RIC in a non-parametric way. Default = `NULL`, which means that k0 = floor(√ n0) and k1 = floor(√ n1). See Tian, Y. and Feng, Y., 2020 for details. Only available when `criterion` = 'nric'.
`classification`	the indicator of the problem type, which can be TRUE, FALSE or `NULL`. Default = `NULL`, which will automatically set `classification` = TRUE if the number of unique response value ≤ 10. Otherwise, it will be set as FALSE.
`...`	additional arguments.

A list including the following items.

`model`	the model used in RaSE screening.
`criterion`	the criterion to choose the best subspace for each weak learner.
`B1`	the number of selected subspaces.
`B2`	the number of subspace candidates generated for each of B1 subspaces.
`n`	the sample size.
`p`	the dimension of data.
`D`	the maximal subspace size when generating random subspaces.
`iteration`	the number of iterations.
`selected.perc`	A list of length (`iteration`+1) recording the selected percentages of each feature in B1 subspaces. When it is of length 1, the result will be automatically transformed to a vector.
`scale`	a list of scaling parameters, including the scaling center and the scale parameter for each feature. Equals to `NULL` when the data is not scaled by `RaScreen`.

Tian, Y. and Feng, Y., 2021(a). RaSE: A variable screening framework via random subspace ensembles. Journal of the American Statistical Association, (just-accepted), pp.1-30.

Tian, Y. and Feng, Y., 2021(b). RaSE: Random subspace ensemble classification. Journal of Machine Learning Research, 22(45), pp.1-93.

Chen, J. and Chen, Z., 2008. Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95(3), pp.759-771.

Chen, J. and Chen, Z., 2012. Extended BIC for small-n-large-P sparse GLM. Statistica Sinica, pp.555-574.

Schwarz, G., 1978. Estimating the dimension of a model. The annals of statistics, 6(2), pp.461-464.

Rase, RaRank.

set.seed(0, kind = "L'Ecuyer-CMRG")
train.data <- RaModel("screening", 1, n = 100, p = 100)
xtrain <- train.data$x
ytrain <- train.data$y

# test RaSE screening with linear regression model and BIC
fit <- RaScreen(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, model = 'lm',
cores = 2, criterion = 'bic')

# Select D variables
RaRank(fit, selected.num = "D")


## Not run: 
# test RaSE screening with knn model and 5-fold cross-validation MSE
fit <- RaScreen(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, model = 'knn',
cores = 2, criterion = 'cv', cv = 5)

# Select n/logn variables
RaRank(fit, selected.num = "n/logn")


# test RaSE screening with SVM and 5-fold cross-validation MSE
fit <- RaScreen(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, model = 'svm',
cores = 2, criterion = 'cv', cv = 5)

# Select n/logn variables
RaRank(fit, selected.num = "n/logn")


# test RaSE screening with logistic regression model and eBIC (gam = 0.5). Set iteration number = 1
train.data <- RaModel("screening", 6, n = 100, p = 100)
xtrain <- train.data$x
ytrain <- train.data$y

fit <- RaScreen(xtrain, ytrain, B1 = 100, B2 = 100, iteration = 1, model = 'logistic',
cores = 2, criterion = 'ebic', gam = 0.5)

# Select n/logn variables from the selected percentage after one iteration round
RaRank(fit, selected.num = "n/logn", iteration = 1)

## End(Not run)