Fuzzy forests is an extension of random forests designed to yield less biased
variable importance rankings when there is high correlation among the variables.
For further information about fuzzy forests see the paper at the following link:
https://github.com/daniel-conn17/FuzzyForestPaper.
In this vignette we introduce the basic capabilities of the fuzzyforest
package.
We demonstrate two methods of fitting fuzzy forests. The first method allows
the user to pre-specify how features should be grouped prior to application
of fuzzy forests. This method uses the ff
function.
fuzzyforest
also allows for easy integration with Weighted Gene Co-expression
Network Analysis (WGCNA) via the function
wff
. Although WGCNA was motivated by problems in genetics, WGCNA can be
viewed as a more general framework for network analysis and clustering of
features. At its core, WGCNA uses information derived from the correlation
matrix to separate the features into distinct "modules".
As a result, we believe it is appropriate to apply WGCNA to datasets
in contexts aside from genetics.
When wff
is called WGCNA is used to partition the covariates into
distinct modules such that the clusters are roughly uncorrelated with
one another. Fuzzy forests is then applied using this partition.
In this vignette we analyze a data set concerning fetal heart rate and
uterine contraction from cardiotocograms. For the purpose of this vignette,
we utilize a randomly chosen subsample of the full data set.
The full data set can be obtained from the UCI machine learning repository.
The data set contains 100 observations and 21 features.
The outcome is a categorical outcome representing the state of the
fetus. It takes on 3 different values (N=normal; S=suspect; P=pathologic).
In order to use WGCNA with fuzzyforest
, packages from Bioconductor must be installed.
require(knitr) opts_chunk$set(eval = FALSE)
To install these packages, type the following command into the R console:
setRepositories(ind=1:2) install.packages("WGCNA") source("http://bioconductor.org/biocLite.R") biocLite("AnnotationDbi", type="source") biocLite("GO.db")
Further information about the installation requirements for WGCNA can be found at the following link: http://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/#manualInstall.
In general, the packages WGCNA
and randomForest
must be attached to take full advantage of fuzzyforest
's functionality.
library(WGCNA) library(randomForest) library(fuzzyforest)
To use the function ff
, we first need to obtain a partitioning of the
features into distinct clusters. In this vignette, we use WGCNA to partition
the features.
#set seed so that results are reproducible set.seed(1) #extract features and covariates from ctg data set X <- ctg[, 2:22] NSP <- ctg[, 1] net = blockwiseModules(X, power = 6, minModuleSize = 1)
We then extract the module membership of each feature.
module_membership <- net$colors
We then set up values for various tuning parameters.
Fuzzy forests first screens out unimportant features from each module
via recursive feature elimination. Then it selects the top $k$ features
where $k$ is prespecified by the user. screening_params
contains tuning
parameters pertaining to the elimination of features within modules.
select_params
contains tuning parameters pertaining to the elimination
of features surviving this initial screening step.
Note that because there are only 21 covariates minModuleSize
must be set to 1
(by default it is 30).
net = blockwiseModules(X, power = 6, minModuleSize = 1, nThreads = 1) module_membership <- net$colors mtry_factor <- 1; min_ntree <- 500; drop_fraction <- .5; ntree_factor <- 1 nodesize <- 1; final_ntree <- 500 screen_params <- screen_control(drop_fraction = drop_fraction, keep_fraction = .25, min_ntree = min_ntree, ntree_factor = ntree_factor, mtry_factor = mtry_factor) select_params <- select_control(drop_fraction = drop_fraction, number_selected = 5, min_ntree = min_ntree, ntree_factor = ntree_factor, mtry_factor = mtry_factor)
mtry_factor
: Fuzzy forests uses random forest recursive feature elimination
to eliminate unimportant features. For each of these random forests, a value of
mtry
must be selected. Letting $p'$, be the number of covariates in the
current random forest, mtry
is approximately mtry_factor
$\sqrt p'$ (more
precisely, mtry
=min(ceiling(mtry_factor
$\sqrt p'$, $p'$))). Similarly, for
classification, mtry
is approximately mtry_factor
$(p'/3)$. Higher values of
mtry
generally allow the algorithm to zone in on the most important features
at the risk of overfitting. Selecting a lower value of mtry
reduces the
chances of overfitting, but increases the chances that an important feature is
overlooked. ntree_factor
and min_ntree
: For each of these random forests, the number
of trees also depends on the current number of features $p'$. If the number of
covariates is large, more trees must be grown. The number of trees grown for
each random forest is approximately max(min_ntree
, ntree_factor
$*p'$). In
general, growing more trees is better (no risk of overfitting), however, if too
many trees are grown fuzzy forests will take a very long to run.drop_fraction
: After each random forest, the features with the lowest
variable importance ranking are dropped. The number of features dropped
at each such step, is equal to ceiling(drop_fraction
$p$). Lower values
of
drop_fraction` lead to more aggressive model selection and higher running
time.keep_fraction
: keep_fraction
is the percentage
of features from each module that are retained during the screening step
of fuzzy forests. number_selected
is the final number of features selected by fuzzy forests.final_ntree
: After the important features have been selected,
a final random forest is fit using these selected features. final_ntree
is the number of trees grown in this random forest.nodesize
: This parameter controls the depth of the regression trees.nodesize
controls the minimum size of terminal nodes.nodesize
is set to 1. However, it is often advantageous to set
nodesize larger. If the number of observations is large, nodesize
often must
be set larger than 1 for the sake computational speed or to prevent memory
issues.Fuzzy forests is then fit using the function ff
:
ff_fit <- ff(X, NSP, module_membership = module_membership, screen_params = screen_params, select_params=select_params, final_ntree = 500)
Likewise, fuzzy forests may also be fit via the wff
function.
Ideally, tuning parameters for WGCNA should be selected with care.
Ideally, the resulting modules should be scientifically meaningful.
For convenience and to make it easier to get started using fuzzy forests,
wff
automatically carries out WGCNA. Parameters for WGCNA are input through
the object WGCNA_params
WGCNA_params <- WGCNA_control(p = 6, minModuleSize = 1, nThreads = 1) wff_fit <- wff(X, NSP, WGCNA_params = WGCNA_params, screen_params = screen_params, select_params = select_params, final_ntree = final_ntree, num_processors = 1, nodesize = nodesize)
ff_fit <- example_ff
wff
and ff
both return objects of type fuzzy_forest
, a list containing
the results of fuzzy forests. A list of the top $k$ features is returned in
a data.frame
via the following call:
rankings <- ff_fit$feature_list rankings
The highest ranked feature is ASTV or "percentage of time with abnormal long term variability." Followed by "mean value of short term variability", "histogram mean", "mean value of long term variability", and "minimum of FHR histogram."
After features are selected, a random forest is fit using these selected features. This random forest is accessed by the following command and can be used to obtain predictions for new data. Note that the mse reported below is overly optimistic. For classification, the reported error rates will also be overly optimistic. The recursive feature elimination biases the usual out of bag error rate.
final_rf <- ff_fit$final_rf final_rf_mse <- tail(final_rf$mse, 1)
cat(" warning!", "\n", "biased estimate of the mse:", final_rf_mse)
The function modplot
may be applied to objects of type fuzzy_forest
to
obtain a graph depicting which modules are over-represented in the list of the
most important features. In this case, the turquoise module appears to be slightly
over-represented.
modplot(ff_fit)
or
modplot(wff_fit)
The variable importances of the selected features can be graphically
displayed by using the function varImpPlot
from the package randomForest
.
varImpPlot(final_rf)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.