output: pdf_document: default html_document: default bibliography: reference.bib
title: "A demonstration of the RaSEn package" author: "Ye Tian and Yang Feng" date: "2020-05-27" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{RaSEn demo} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}
We provide a detailed demo of the usage for the \verb+RaSEn+ package.
Suppose we have training data ${\mathbf{x}i, y_i}{i=1}^n \in {\mathbb{R}^p, {0, 1}}$, where each $\mathbf{x}_i$ is a $1 \times p$ vector.
Based on training data, RaSE algorithm aims to generate $B_1$ weak learners ${C_n^{S_j}}{j=1}^{B_1}$, each of which is constructed in a feature subspace $S_j \subseteq {1, ..., p}$ instead using all $p$ features. To obtain each weak learner, $B_2$ candidates ${C_n^{S{jk}}}{k=1}^{B_2}$ are trained based in subspaces ${S{jk}}_{k=1}^{B_2}$, respectively. To choose the optimal one among these $B_2$ candidates, some criteria need to be applied, including minimizing ratio information criterion (RIC, @ye2020rase), minimizing extended Bayes information criterion (eBIC, @chen2008extended, @chen2012extended), minimizing the training error, minimizing the validation error (if validation data is available), minimizing the cross-validation error, minimizing leave-one-out error etc. And the type of weak learner can be quite flexible.
To better adapt RaSE into the sparse setting, we can update the distribution of random feature subspaces according to the frequencies of features in $B_1$ subspaces in each round. This can be seen as an adaptive strategy to increase the possibility to cover the signals that contribute to our model, which can improve the performance of RaSE classifiers in sparse settings.
The frequencies of $p$ features in $B_1$ subspaces can be used for feature ranking as well. And we could plot the frequencies to intuitively rank the importance of each feature in a RaSE model.
RaSEn
can be installed from CRAN.
install.packages("RaSEn", repos = "http://cran.us.r-project.org")
Then we can load the package:
library(RaSEn)
We will show in this section how to fit RaSE classifiers based on different types of base classifiers. First we generate the data from a binary guanssian mixture model (Model 1 in @ye2020rase) $$ \mathbf{x} \sim (1-y)N(\mathbf{\mu}^{(0)}, \Sigma) + yN(\mathbf{\mu}^{(1)}, \Sigma), $$ where $\mathbf{\mu}^{(0)}, \mathbf{\mu}^{(1)}$ are both $1 \times p$ vectors, $\Sigma$ is a $p \times p$ symmetric positive definite matrix. Here $y$ follows a bernoulli distribution: $$ y \sim Bernoulli(\pi_1), $$ where $\pi_1 \in (0,1)$ and we denote $\pi_0 = 1-\pi_1$.
Here we follow from the setting of @mai2012direct, letting $\Sigma = (0.5^{|i-j|}){p \times p} , \mathbf{\mu}^{(0)} = \mathbf{0}{p \times 1}, \mathbf{\mu}^{(1)} = \Sigma^{-1}\times 0.556(3, 1.5, 0, 0, 2, \mathbf{0}_{1 \times p-5})^T$. Let $n = 100, p =50$. According to the definition of minimal discriminative set in Ye Tian and Yang Feng (2020), here the minimal discriminative set $S^* = {1, 2, 5}$, which contribute to the classification.
Apply function RaModel
to generate training data and test data of size 100 with dimension 50.
set.seed(0, kind = "L'Ecuyer-CMRG")
train.data <- RaModel(1, n = 100, p = 50)
test.data <- RaModel(1, n = 100, p = 50)
xtrain <- train.data$x
ytrain <- train.data$y
xtest <- test.data$x
ytest <- test.data$y
We can visualize the first two dimensions or feature 1 and 2 as belows:
library(ggplot2)
ggplot(data = data.frame(xtrain, y = factor(ytrain)), mapping = aes(x = X1,
y = X2, color = y)) + geom_point()
Similarly, we can also visualize the feature 6 and 7:
ggplot(data = data.frame(xtrain, y = factor(ytrain)), mapping = aes(x = X6, y = X7,
color = y)) + geom_point()
It's obvious to see that in dimension 1 and 2 the data from two classes are more linearly seperate than in dimension 6 and 7. Then we call
Rase
function to fit the RaSE classifier with LDA, QDA and logistic regression base classifiers with criterion of minimizing RIC and RaSE classifier with knn base classifier with criterion of minimizing leave-one-out error. To use different types of base classifier, we set base
as "lda", "qda", "knn" and "logistic", repectively. B1
is set to be 100 to generate 100 weak learners and B2
is set to be 100 as well to generate 100 subspace candidates for each weak learner. Without using iterations, we set iteration
as 0. cutoff
is set to be TRUE to apply the empirical optimal threshold for ensemble classification. criterion
is set to be "ric" for RaSE classifier with LDA, QDA and logistic regression while it is "loo" for RaSE classifier with knn base classifier. Since we want to do feature ranking, we set ranking
as TRUE to get the frequencies of features in $B_1$ subspaces. To speed up the computation, we apply parallel computing with 2 cores by setting cores = 2
.
fit.lda <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, cutoff = TRUE,
subset.size.max = NULL, base = "lda", cores = 2, criterion = "ric",
ranking = TRUE)
fit.qda <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, cutoff = TRUE,
subset.size.max = NULL, base = "qda", cores = 2, criterion = "ric",
ranking = TRUE)
fit.knn <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, cutoff = TRUE,
subset.size.max = NULL, base = "knn", cores = 2, criterion = "loo",
ranking = TRUE)
fit.logistic <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0,
cutoff = TRUE, subset.size.max = NULL, base = "logistic", cores = 2,
criterion = "ric", ranking = TRUE)
To evaluate the performance of four different models, we calculate the test error on test data:
er.lda <- mean(predict(fit.lda, xtest) != ytest)
er.qda <- mean(predict(fit.qda, xtest) != ytest)
er.knn <- mean(predict(fit.knn, xtest) != ytest)
er.logistic <- mean(predict(fit.logistic, xtest) != ytest)
cat("LDA:", er.lda, "QDA:", er.qda, "knn:", er.knn, "logistic:", er.logistic)
## LDA: 0.03 QDA: 0.07 knn: 0.05 logistic: 0.05
And the output of Rase
function is an object belonging to S3 class "RaSE". It contains:
marginal: the marginal probability for each class.
fit.list: a list of B1 fitted base classifiers.
B1: the number of weak learners.
B2: the number of subspace candidates generated for each weak learner.
base: the type of base classifier.
cutoff: the empirically optimal threshold.
subspace: a list of subspaces correponding to B1 weak learners.
ranking: the frequency of each feature in B1 subspaces.
The frequencies of features in $B_1$ subspaces for four RaSE classifiers are contained in the output, which can be used for feature ranking. We can plot them by using ggplot
function:
library(gridExtra)
plot_lda <- ggplot(data = data.frame(frequency = fit.lda$ranking, feature = 1:50),
mapping = aes(y = frequency, x = feature)) + geom_point() + ggtitle(expression("RaSE-LDA")) +
theme(plot.title = element_text(hjust = 0.5))
plot_qda <- ggplot(data = data.frame(frequency = fit.qda$ranking, feature = 1:50),
mapping = aes(y = frequency, x = feature)) + geom_point() + ggtitle(expression("RaSE-QDA")) +
theme(plot.title = element_text(hjust = 0.5))
plot_knn <- ggplot(data = data.frame(frequency = fit.knn$ranking, feature = 1:50),
mapping = aes(y = frequency, x = feature)) + geom_point() + ggtitle(expression("RaSE-kNN")) +
theme(plot.title = element_text(hjust = 0.5))
plot_logistic <- ggplot(data = data.frame(frequency = fit.logistic$ranking,
feature = 1:50), mapping = aes(y = frequency, x = feature)) + geom_point() +
ggtitle(expression("RaSE-logistic")) + theme(plot.title = element_text(hjust = 0.5))
grid.arrange(plot_lda, plot_qda, plot_knn, plot_logistic, ncol = 2)
From four figures, it can be noticed that feature 1, 2 and 5 obtain the highest frequencies among all $p = 50$ features when the base classifier is taken as LDA, QDA and $k$NN, implying their importance in classification model. We can set a positive iteration number to make their frequencies of appearing in B1 subspaces increase, which may improve the performance.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.