A demonstration of the RaSEn package


title: "A demonstration of the RaSEn package" author: "Ye Tian and Yang Feng" date: "r Sys.Date()" header-includes: - \usepackage{dsfont} - \usepackage{bm} output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{RaSEn demo} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}


We provide a detailed demo of the usage for the \verb+RaSEn+ package. This package implements the random subspace ensemble classification (RaSE) method (@tian2021rase), the variable screening approach via RaSE (@tian2021variable) and the super RaSE method (@zhu2021super).

library(formatR)

Random Subspace Ensemble Classification{#rase}

Introduction{#intro}

Suppose we have training data ${\bm{x}i, y_i}{i=1}^n \in {\mathbb{R}^p, {0, 1}}$, where each $\bm{x}_i$ is a $1 \times p$ vector.

Based on training data, RaSE algorithm aims to generate $B_1$ weak learners ${C_n^{S_j}}{j=1}^{B_1}$, each of which is constructed in a feature subspace $S_j \subseteq {1, ..., p}$ instead using all $p$ features. To obtain each weak learner, $B_2$ candidates ${C_n^{S{jk}}}{k=1}^{B_2}$ are trained based in subspaces ${S{jk}}_{k=1}^{B_2}$, respectively. To choose the optimal one among these $B_2$ candidates, some criteria need to be applied, including minimizing ratio information criterion (RIC, @tian2021rase), minimizing extended Bayes information criterion (eBIC, @chen2008extended, @chen2012extended), minimizing the training error, minimizing the validation error (if validation data is available), minimizing the cross-validation error, minimizing leave-one-out error etc. And the type of weak learner can be quite flexible.

To better adapt RaSE into the sparse setting, we can update the distribution of random feature subspaces according to the selected percentage of features in $B_1$ subspaces in each round. This can be seen as an adaptive strategy to increase the possibility to cover the signals that contribute to our model, which can improve the performance of RaSE classifiers in sparse settings.

The selected percentage of each of $p$ features in $B_1$ subspaces can be used for feature ranking as well. And we could plot the selected percentage to intuitively rank the importance of each feature in a RaSE model.

Installation{#install}

RaSEn can be installed from CRAN.

install.packages("RaSEn", repos = "http://cran.us.r-project.org")

Then we can load the package:

library(RaSEn)

How to Fit a RaSE Classifier for Prediction{#fit}

We will show in this section how to fit RaSE classifiers based on different types of base classifiers. First we generate the data from a binary guanssian mixture model (model 1 in @tian2021rase) $$ \bm{x} \sim (1-y)N(\bm{\mu}^{(0)}, \Sigma) + yN(\bm{\mu}^{(1)}, \Sigma), $$ where $\bm{\mu}^{(0)}, \bm{\mu}^{(1)}$ are both $1 \times p$ vectors, $\Sigma$ is a $p \times p$ symmetric positive definite matrix. Here $y$ follows a bernoulli distribution: $$ y \sim Bernoulli(\pi_1), $$ where $\pi_1 \in (0,1)$ and we denote $\pi_0 = 1-\pi_1$.

Here we follow from the setting of @mai2012direct, letting $\Sigma = (0.5^{|i-j|}){p \times p} , \bm{\mu}^{(0)} = \bm{0}{p \times 1}, \bm{\mu}^{(1)} = \Sigma^{-1}\times 0.556(3, 1.5, 0, 0, 2, \bm{0}_{1 \times (p-5)})^T$. Let $n = 100, p =50$. According to the definition of minimal discriminative set in @tian2021rase, here the minimal discriminative set $S^* = {1, 2, 5}$, which contribute to the classification.

Apply function RaModel to generate training data and test data of size 100 with dimension 50.

set.seed(0, kind = "L'Ecuyer-CMRG")
train.data <- RaModel("classification", 1, n = 100, p = 50)
test.data <- RaModel("classification", 1, n = 100, p = 50)
xtrain <- train.data$x
ytrain <- train.data$y
xtest <- test.data$x
ytest <- test.data$y

We can visualize the first two dimensions or feature 1 and 2 as belows:

library(ggplot2)
ggplot(data = data.frame(xtrain, y = factor(ytrain)), mapping = aes(x = X1, y = X2, color = y)) + geom_point()

Similarly, we can also visualize the feature 6 and 7:

ggplot(data = data.frame(xtrain, y = factor(ytrain)), mapping = aes(x = X6, y = X7, color = y)) + geom_point()

It's obvious to see that in dimension 1 and 2 the data from two classes are more linearly seperate than in dimension 6 and 7. Then we call Rase function to fit the RaSE classifier with LDA, QDA and logistic regression base classifiers with criterion of minimizing RIC and RaSE classifier with knn base classifier with criterion of minimizing leave-one-out error. To use different types of base classifier, we set base as "lda", "qda", "knn" and "logistic", repectively. B1 is set to be 100 to generate 100 weak learners and B2 is set to be 100 as well to generate 100 subspace candidates for each weak learner. Without using iterations, we set iteration as 0. criterion is set to be "ric" for RaSE classifier with LDA, QDA and logistic regression while it is "loo" for RaSE classifier with knn base classifier. To speed up the computation, we apply parallel computing with 2 cores by setting cores = 2.

fit.lda <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, base = "lda", cores = 2, criterion = "ric")
fit.qda <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, base = "qda", cores = 2, criterion = "ric")
fit.knn <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, base = "knn", cores = 2, criterion = "loo")
fit.logistic <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, base = "logistic", cores = 2, criterion = "ric")

We can print the summarized results of RaSE model by calling print function. For instance, we print the RaSE model with LDA base classifier:

print(fit.lda)

To evaluate the performance of four different models, we calculate the test error on test data:

er.lda <- mean(predict(fit.lda, xtest) != ytest)
er.qda <- mean(predict(fit.qda, xtest) != ytest)
er.knn <- mean(predict(fit.knn, xtest) != ytest)
er.logistic <- mean(predict(fit.logistic, xtest) != ytest)
cat("LDA:", er.lda, "QDA:", er.qda, "knn:", er.knn, "logistic:", er.logistic)

And the output of Rase function is an object belonging to S3 class "RaSE". It contains:

How to Use RaSE for Feature Ranking{#fk}

The selected percentage of features in $B_1$ subspaces for four RaSE classifiers are contained in the output, which can be used for feature ranking. We can plot them by using RaPlot function:

library(gridExtra)
plot_lda <- RaPlot(fit.lda)
plot_qda <- RaPlot(fit.qda)
plot_knn<- RaPlot(fit.knn)
plot_logistic <- RaPlot(fit.logistic)

grid.arrange(plot_lda, plot_qda, plot_knn, plot_logistic, ncol=2)

From four figures, it can be noticed that feature 1, 2 and 5 obtain high selected percentage among all $p = 50$ features under LDA, QDA and $k$NN models, implying their importance in classification model. We can set a positive iteration number to increase the selected percentage of three signals among $B_1$ subspaces, which may improve the performance.

Super RaSE{#super}

In RaSE, we consider only a single type of base classifiers (e.g. LDA, QDA, $k$NN, etc.). @zhu2021super extends the idea of RaSE by combining classifiers of different types. For each of the $B_1B_2$ weak learners, the base classifier type is sampled randomly from some given types with corresponding probabilities. For iterative super RaSE, not only the feature sampling probability will be updated based on the feature selected frequencies, but the classifier type sampling probability will be updated according to the type selected frequencies as well. Note that here the feature sampling probability is updated on the basis of feature frequencies in the last iteration for corresponding base classifier type. The user can also fix the classifier type sampling probability. It can be controled by component base.update in parameter super. The super RaSE will be fitted when the parameter base is a string vector of base classifier types or a numeric probability vector with classifier type names. In the first case, the base classifier type will be sampled uniformly, while in the second case, it will be sampled according to the provided probability.

The following example shows how to fit a super RaSE classifier which mixes $k$NN, LDA and logistic regression.

fit.super <- Rase(xtrain = xtrain, ytrain = ytrain, B1 = 100, B2 = 50, base = c("knn", "lda", "logistic"), super = list(type = "separate", base.update = T), criterion = "cv", cv = 5, iteration = 0, cores = 2)
ypred <- predict(fit.super, xtest)
mean(ypred != ytest)

We can look at the sampling percentage of each feature from each classifier type and the base classifier type selected percentages.

fit.super$ranking.feature
fit.super$ranking.base

Variable Screening via Random Subspace Ensemble{#screening}

In this section, we describe how to apply RaSE for variable screening.

Introduction{#intro2}

We follow the aforementioned notations. Although @tian2021rase only discusses the classification problem, RaSE framework can be imediately applied for continuous response with no extra effort. As before, we would like to select $B_1$ subspaces ${S_{j}}{j=1}^{B_1}$, for each of which $B_2$ candidate subspaces ${S{jk}}{k=1}^{B_2}$ are generated. Some specific criterion is required for subspace selection. The selected percentage of features in ${S{j}}_{j=1}^{B_1}$ can be used for variable screening (@tian2021variable).

Variable screening via RaSE{#vs}

We will present how to do variable screening through RaSE framework in this section. First we generate the data from the following model (model 1 in @tian2021rase, model II in @fan2008sure). $$ y = 5x_1 + 5x_5 + 5x_3 - \frac{15}{\sqrt{2}}x_4 + \epsilon, $$ where $\bm{x} = (x_1, \ldots, x_p)^T \sim N(\bm{0}, \Sigma)$, $\Sigma = (\sigma_{ij}){p \times p}$, $\sigma{ij} = 0.5^{\mathds{1}(i \neq j)}$, $\epsilon\sim N(0, 1)$, and $\epsilon \perp !!! \perp \bm{x}$. The signal set $S^* = {1, 2, 3, 4}$.

Let $n=100$ and $p=100$. Call function RaModel to generate the data.

train.data <- RaModel("screening", 1, n = 100, p = 100)
xtrain <- train.data$x
ytrain <- train.data$y

As @tian2021rase describes, here $x_4$ is marginally independent of $y$. We first apply RaSE equipped with linear regression model and BIC by calling function RaScreen. Set B1 = B2 = 100, model = lm and criterion = bic. To demonstrate the power of iterations, we set iteration = 1.

fit.bic <- RaScreen(xtrain, ytrain, B1 = 100, B2 = 100, model = "lm", criterion = "bic", cores = 2, iteration = 1)

The output of RaScreen is a list including the selected percentage of variables, the model we use and other information. Note that the selected percentage of variables are stored in a list (when iteration = 0, it is a vector) called "selected.perc". All results from different iteration rounds are available. Function RaRank provides a convenient and automatic way to select variables from the output of RaScreen. We set selected.num = n/logn to select $[n/\log n] = 21$ variables. Let's compare the results from vanilla RaSE (no iteration) and RaSE with 1 interation round as follows.

RaRank(fit.bic, selected.num = "n/logn", iteration = 0)
RaRank(fit.bic, selected.num = "n/logn", iteration = 1)

Observe that RaSE with linear regression model and BIC captures the signals very well. Interation indeed improves vanilla RaSE by ranking four signals on the top.

Note that there could be some variables with zero selected percentage, especially in iterative RaSE. In this case, such variables are indistinguishable. When the user requests more variables beyond the number of variables with positive selected percentages, RaRank function will randomly sample from variables with zero selected percentage and pop up a warning message as below.

RaRank(fit.bic, selected.num = "n-1", iteration = 1)

Reference



Try the RaSEn package in your browser

Any scripts or data that you put into this service are public.

RaSEn documentation built on Oct. 16, 2021, 9:06 a.m.