The R package iRF
implements iterative Random Forests, a method for
iteratively growing ensemble of weighted decision trees, and detecting
high-order feature interactions by analyzing feature usage on decision paths.
It uses source codes from the R package randomForest
by Andy Liaw and Matthew
Weiner, the original Fortran codes by Leo Breiman and Adele Cutler, and source
codes from the R package FSInteract
by Hyun Jik Kim and Rajen D. Shah.
iRF
can be loaded using the library
command:
library(iRF)
iRF
adds two new features to Breiman's original Random Forest workflow:
Unlike Breiman's original random forest, which uses uniform random sampling to
select \texttt{mtry} variables during each node split, the 'randomForest'
function in 'iRF' allows non-uniform sampling using a given vector of
nonnegative weights (e.g., feature importances from a previous model fit). In
particular, given a vector of weights $\mathbf{w} = (w_1, \ldots, w_p)$, the
mtry
variables are selected so that $P(\text{variable }j \text{ is selected})
= w_j/\sum_{j=1}^p w_j$, for $j = 1, \ldots, p$.
Based on this weighting scheme, we can iteratively grow weighted random forests, where Gini importances from the previous random forest fit are used as weights in the current iteration.
# set seed for random number generation set.seed(47)
We simulate some data for a binary classification exercise. Out of $250$ features, only three ${1, 2, 3}$ are important.
# simulate data for classification n <- 500 p <- 250 X <- matrix(rnorm(n * p), nrow=n) Y <- as.numeric(X[,1] > 0 & X[,2] > 0 & X[,3] > 0) Y <- as.factor(Y) train.id <- 1:(n / 2) test.id <- setdiff(1:n, train.id)
Next, we iteratively grow weighted random forests and use feature importances from last iteration as weights.
sel.prob <- rep(1/p, p) # iteratively grow RF, use Gini importance of features as weights rf <- list() for (iter in 1:4){ rf[[iter]] <- randomForest(x=X[train.id,], y=Y[train.id], xtest=X[test.id,], ytest=Y[test.id], mtry.select.prob=sel.prob) # update selection probabilities for next iteration sel.prob <- rf[[iter]]$importance }
We can measure performance of different iterations of RF on the test set:
library(AUC) plot(0:1, 0:1, type='l', lty = 2, xlab = 'FPR', ylab = 'TPR', main='ROC Curve') for (iter in 1:4){ # performance on test set cat(paste('iter = ', iter, ':: ')) roc.info <- roc(rf[[iter]]$test$votes[,2], Y[test.id]) lines(roc.info$fpr, roc.info$tpr, type='l', col=iter, lwd=2) cat(paste('AUROC: ', round(100*auc(roc.info), 2), '%\n', sep='')) } legend('bottomright', legend=paste('iter:', 1:iter), col=1:iter, lwd=2, bty='n')
The outputs of iRF::randomForest
are objects of class randomForest
and can
be used with other functions in the R package randomForest
directly, e.g., to
visualize variable importance measures:
par(mfrow=c(1,4)) for (iter in 1:4) varImpPlot(rf[[iter]], n.var=10, main=paste('Variable Importance (iter:', iter, ')') )
For regression, the usage is similar:
# change to a continuous response y <- as.numeric(Y) + rnorm(n, sd=0.25) # iteratively grow RF, use Gini importance of features as weights rf <- list() sel.prob <- rep(1/p, p) for (iter in 1:4){ cat(paste('iter = ', iter, ':: ')) rf[[iter]] <- randomForest(x=X[train.id,], y=y[train.id], xtest=X[test.id,], ytest=y[test.id], mtry.select.prob=sel.prob) # update selection probabilities for next iteration sel.prob <- rf[[iter]]$importance/sum(rf[[iter]]$importance) # performance on test set ytest <- y[test.id] test.error = 1 - mean((rf[[iter]]$test$predicted - ytest) ^ 2) / var(ytest) cat(paste('% var explained: ', round(100 * test.error, 2), '%\n', sep='')) }
The weighted random forests can be grown in parallel on multiple servers, using
the doMC
library:
# set up cores for parallel implementation library(doMC) registerDoMC() n.cores <- detectCores() options(cores=n.cores) n.tree.per.core <- 30 rfpar <- foreach(n.tree=rep(n.tree.per.core, n.cores), .combine=combine, .multicombine=TRUE) %dopar% { randomForest(x=X[train.id,], y=Y[train.id], ntree=n.tree) }
iRF
detects high-order interaction among features by analyzing feature usage
on the decision paths of a random forest. In particular, given a (weighted)
random forest fit, iRF
(i) passes the training data through the fitted forest
and records the features used on the associated decision paths; (ii) Applies a
weighted version of the random intersection tree (RIT) algorithm proposed by
Shah and Meinshausen (2014) to find
high-order feature combinations prevalent on the decision paths, where weights
are determined by user specified features; (iii) performs the above two steps on
many bootstrap replicates of the training set to assess stability of the
features and their interactions.
Consider a classification example as before. We iteratively grow weighted random forests, but save the forest components and in bag samples for further processing.
sel.prob <- rep(1/p, p) # iteratively grow RF, use Gini importance of features as weights rf <- list() for (iter in 1:4){ rf[[iter]] <- randomForest(x=X[train.id,], y=Y[train.id], xtest=X[test.id,], ytest=Y[test.id], mtry.select.prob=sel.prob, keep.forest=TRUE, keep.inbag=TRUE) # update selection probabilities for next iteration sel.prob <- rf[[iter]]$importance / sum(rf[[iter]]$importance) }
To read feature usage on nodes, use the function readForest
. This function can
be run in parallel with the argument n.core
, provided doMC
has been set up
for parallel processing.
rforest <- readForest(rand.forest=rf[[3]], x=X[train.id,])
The tree.info
data frame provides meta data for each leaf node.
head(rforest$tree.info, n=10)
The readForest
function optionally computes a sparse matrix node.feature
that encodes the splitting features and direction of inequality along the
decision path of each leaf node. This encoding can be represented as a binary
vector of length $2p$`, where an entry $j\in{1\dots p}$ indicates that a
decision path contains the left child of a node that splits on feature $j$ and
an entry $j\in{p+1\dots 2p}$ indicates that a decision path contains the right
child of a node that splits on feature $j-p$.
rforest$node.feature[1:10, c(1:5, p + 1:5)]
To find prevalent sets of features and their high-order combinations used to define these nodes, use the random intersection trees (RIT) function. The following command runs RIT on all class 1 nodes, sampling each leaf node with probability proportional to the number of observations in each leaf node:
class1.nodes <- rforest$tree.info$prediction == 1 wt <- rforest$tree.info$size.node[class1.nodes] RIT(rforest$node.feature[class1.nodes,], weights=wt, depth=5, branch=2, n_trees=100)
The function iRF
combines all the above steps and uses bootstrap aggregation
to assess stability of the selected interactions. Setting select.iter = TRUE
chooses the optimal iteration number by maximizing prediction accuracy on out of
bag samples and searches for interactions in the corresponding random forest.
fit <- iRF(x=X[train.id,], y=Y[train.id], n.iter=5, n.core=n.cores, n.bootstrap=10, int.return=5 )
The output of iRF
is a list containing up to three entries. If select.iter =
FALSE
, outputs for each entry will be lists with one entry per iteration.
(1) rf.list
: a randomForest
object.
(2) iter.select
: The selected iteration.
(3) interaction
: a data table of interactions and associated importance
metrics. For a more detailed description of these importance metrics, see our
paper: https://arxiv.org/abs/1810.07287.
+ `int` indicates the interaction, with $+$ and $-$ describing whether the interaction is characterized by high or low levels of the feature. For instance, the interaction `X1+_X2+_X3+` corresponds to decision rules of the form $1(x_1 > \cdot \& x_2 > \cdot \& x_3 > \cdot). + `prevalence` indicates the proportion of class-1 leaf nodes that contain the interaction, weighted y the size of the leaf nodes. + `precision` indicates the proportion of class-1 observations that fall in leaf nodes that contain the innteraction. + `cpe` indicates the class prevalence enrichment, which is defined as the difference in prevalence betweenn class-0 and class-1 leaf nodes. `sta.cpe` indicates the proportion of times this value is greater than 0 across bootstrap replicates. + `fsd` indicates the feature selection dependence, which is defined as the difference between the prevalence of an interaction and its expected prevalence if interacting features were selected independently of one another. `sta.fsd` indicates the proportion of times this value is greater than 0 across bootstrap replicates. + `mip` indicates the mean increase in precision, which is defined as the difference etween the precision of an interaction and the average precision of lower order subsets. `sta.mip` indicates the proportion of times this value is greater than 0 across bootstrap replicates. + `stability` indicates the proportion of times the interaction was recovered by RIT across bootstrap replicates.
head(fit$interaction)
(4) weights
feature weights used to fit the selected random forest.
The selected interactions can be visualized using simple R functions like dotchart
:
toplot <- fit$interaction$cpe names(toplot) <- fit$interaction$int dotchart(rev(toplot[1:min(20, length(toplot))]), xlab='Prevalence enrichment', xlim=c(0, 1), main='Prevalent Features/Interactions \n on Decision paths')
Interactions can be visualized using response surfaces, which depict the
expected response as a function of interacting features. We use the output of readForest
to generate response surfaces as follows
library(rgl) rforest <- readForest(fit$rf.list, x=X[train.id,]) plotInt(x=X[test.id,], y=Y[test.id], int='X1+_X2+_X3+', read.forest=rforest, xlab='X1', ylab='X2') rglwidget()
The response surface to the left is plotted for observations whose X3 value is
below the median random forest threshold, while the response surface to the
right is plotted for observations whose X3 value is above the median random
forest threshold. plotInt
is supported for up to order-4 interactions.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.