In OHDSI/FuzzyForest: Fuzzy Forests

Fuzzy forests is an extension of random forests designed to yield less biased variable importance rankings when there is high correlation among the variables. For further information about fuzzy forests see the paper at the following link: https://github.com/daniel-conn17/FuzzyForestPaper. In this vignette we introduce the basic capabilities of the fuzzyforest package. We demonstrate two methods of fitting fuzzy forests. The first method allows the user to pre-specify how features should be grouped prior to application of fuzzy forests. This method uses the ff function.

fuzzyforest also allows for easy integration with Weighted Gene Co-expression Network Analysis (WGCNA) via the function wff. Although WGCNA was motivated by problems in genetics, WGCNA can be viewed as a more general framework for network analysis and clustering of features. At its core, WGCNA uses information derived from the correlation matrix to separate the features into distinct "modules". As a result, we believe it is appropriate to apply WGCNA to datasets in contexts aside from genetics. When wff is called WGCNA is used to partition the covariates into distinct modules such that the clusters are roughly uncorrelated with one another. Fuzzy forests is then applied using this partition.

In this vignette we analyze a data set concerning fetal heart rate and uterine contraction from cardiotocograms. For the purpose of this vignette, we utilize a randomly chosen subsample of the full data set.
The full data set can be obtained from the UCI machine learning repository.
The data set contains 100 observations and 21 features.
The outcome is a categorical outcome representing the state of the fetus. It takes on 3 different values (N=normal; S=suspect; P=pathologic).

Installing WGCNA

In order to use WGCNA with fuzzyforest, packages from Bioconductor must be installed.

require(knitr)
opts_chunk$set(eval = FALSE)

To install these packages, type the following command into the R console:

setRepositories(ind=1:2)  
install.packages("WGCNA")
source("http://bioconductor.org/biocLite.R")
biocLite("AnnotationDbi", type="source")
biocLite("GO.db")

Further information about the installation requirements for WGCNA can be found at the following link: http://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/#manualInstall.

Attaching Required Packages

In general, the packages WGCNA and randomForest must be attached to take full advantage of fuzzyforest's functionality.

library(WGCNA)
library(randomForest)
library(fuzzyforest)

Fuzzy Forests

To use the function ff, we first need to obtain a partitioning of the features into distinct clusters. In this vignette, we use WGCNA to partition the features.

#set seed so that results are reproducible
set.seed(1)

#extract features and covariates from ctg data set
X <- ctg[, 2:22]
NSP <- ctg[, 1]
net = blockwiseModules(X, power = 6, minModuleSize = 1)

We then extract the module membership of each feature.

module_membership <- net$colors

We then set up values for various tuning parameters. Fuzzy forests first screens out unimportant features from each module via recursive feature elimination. Then it selects the top $k$ features where $k$ is prespecified by the user. screening_params contains tuning parameters pertaining to the elimination of features within modules. select_params contains tuning parameters pertaining to the elimination of features surviving this initial screening step.

Note that because there are only 21 covariates minModuleSize must be set to 1 (by default it is 30).

net = blockwiseModules(X, power = 6, minModuleSize = 1, nThreads = 1)
module_membership <- net$colors
mtry_factor <- 1; min_ntree <- 500;  drop_fraction <- .5; ntree_factor <- 1
nodesize <- 1; final_ntree <- 500
screen_params <- screen_control(drop_fraction = drop_fraction,
                                 keep_fraction = .25, min_ntree = min_ntree,
                                ntree_factor = ntree_factor,
                                mtry_factor = mtry_factor)
select_params <- select_control(drop_fraction = drop_fraction,
                                number_selected = 5,
                                min_ntree = min_ntree,
                                ntree_factor = ntree_factor,
                                mtry_factor = mtry_factor)

Tips for Setting Tuning Parameters

mtry_factor: Fuzzy forests uses random forest recursive feature elimination to eliminate unimportant features. For each of these random forests, a value of mtry must be selected. Letting $p'$, be the number of covariates in the current random forest, mtry is approximately mtry_factor$\sqrt p'$ (more precisely, mtry=min(ceiling(mtry_factor$\sqrt p'$, $p'$))). Similarly, for classification, mtry is approximately mtry_factor$(p'/3)$. Higher values of mtry generally allow the algorithm to zone in on the most important features at the risk of overfitting. Selecting a lower value of mtry reduces the chances of overfitting, but increases the chances that an important feature is overlooked.
ntree_factor and min_ntree: For each of these random forests, the number of trees also depends on the current number of features $p'$. If the number of covariates is large, more trees must be grown. The number of trees grown for each random forest is approximately max(min_ntree, ntree_factor$*p'$). In general, growing more trees is better (no risk of overfitting), however, if too many trees are grown fuzzy forests will take a very long to run.
drop_fraction: After each random forest, the features with the lowest variable importance ranking are dropped. The number of features dropped at each such step, is equal to ceiling(drop_fraction$p$). Lower values ofdrop_fraction` lead to more aggressive model selection and higher running time.
keep_fraction: keep_fraction is the percentage of features from each module that are retained during the screening step of fuzzy forests.
number_selected is the final number of features selected by fuzzy forests.
final_ntree: After the important features have been selected, a final random forest is fit using these selected features. final_ntree is the number of trees grown in this random forest.
nodesize: This parameter controls the depth of the regression trees.
nodesize controls the minimum size of terminal nodes.
By default, nodesize is set to 1. However, it is often advantageous to set nodesize larger. If the number of observations is large, nodesize often must be set larger than 1 for the sake computational speed or to prevent memory issues.

Fuzzy forests is then fit using the function ff:

ff_fit <- ff(X, NSP, module_membership = module_membership,
            screen_params = screen_params, select_params=select_params,
            final_ntree = 500)

Likewise, fuzzy forests may also be fit via the wff function. Ideally, tuning parameters for WGCNA should be selected with care. Ideally, the resulting modules should be scientifically meaningful. For convenience and to make it easier to get started using fuzzy forests, wff automatically carries out WGCNA. Parameters for WGCNA are input through the object WGCNA_params

WGCNA_params <- WGCNA_control(p = 6, minModuleSize = 1, nThreads = 1)
wff_fit <- wff(X, NSP, WGCNA_params = WGCNA_params,
              screen_params = screen_params,
              select_params = select_params,
              final_ntree = final_ntree,
              num_processors = 1, nodesize = nodesize)

ff_fit <- example_ff

wff and ff both return objects of type fuzzy_forest, a list containing the results of fuzzy forests. A list of the top $k$ features is returned in a data.frame via the following call:

rankings <- ff_fit$feature_list
rankings

The highest ranked feature is ASTV or "percentage of time with abnormal long term variability." Followed by "mean value of short term variability", "histogram mean", "mean value of long term variability", and "minimum of FHR histogram."

After features are selected, a random forest is fit using these selected features. This random forest is accessed by the following command and can be used to obtain predictions for new data. Note that the mse reported below is overly optimistic. For classification, the reported error rates will also be overly optimistic. The recursive feature elimination biases the usual out of bag error rate.

final_rf <- ff_fit$final_rf
final_rf_mse <- tail(final_rf$mse, 1)

cat(" warning!", "\n", "biased estimate of the mse:", final_rf_mse)

The function modplot may be applied to objects of type fuzzy_forest to obtain a graph depicting which modules are over-represented in the list of the most important features. In this case, the turquoise module appears to be slightly over-represented.