Fuzzy forests is an extension of random forests designed to yield less biased variable importance rankings when there is high correlation among the variables. For further information about fuzzy forests see the paper at the following link: https://github.com/daniel-conn17/FuzzyForestPaper. In this vignette we introduce the basic capabilities of the fuzzyforest package. We demonstrate two methods of fitting fuzzy forests. The first method allows the user to pre-specify how features should be grouped prior to application of fuzzy forests. This method uses the ff function.

fuzzyforest also allows for easy integration with Weighted Gene Co-expression Network Analysis (WGCNA) via the function wff. Although WGCNA was motivated by problems in genetics, WGCNA can be viewed as a more general framework for network analysis and clustering of features. At its core, WGCNA uses information derived from the correlation matrix to separate the features into distinct "modules". As a result, we believe it is appropriate to apply WGCNA to datasets in contexts aside from genetics. When wff is called WGCNA is used to partition the covariates into distinct modules such that the clusters are roughly uncorrelated with one another. Fuzzy forests is then applied using this partition.

In this vignette we analyze a data set concerning fetal heart rate and uterine contraction from cardiotocograms. For the purpose of this vignette, we utilize a randomly chosen subsample of the full data set.
The full data set can be obtained from the UCI machine learning repository.
The data set contains 100 observations and 21 features.
The outcome is a categorical outcome representing the state of the fetus. It takes on 3 different values (N=normal; S=suspect; P=pathologic).

Installing WGCNA

In order to use WGCNA with fuzzyforest, packages from Bioconductor must be installed.

To install these packages, type the following command into the R console:

biocLite("AnnotationDbi", type="source")

Further information about the installation requirements for WGCNA can be found at the following link: http://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/#manualInstall.

Attaching Required Packages

In general, the packages WGCNA and randomForest must be attached to take full advantage of fuzzyforest's functionality.


Fuzzy Forests

To use the function ff, we first need to obtain a partitioning of the features into distinct clusters. In this vignette, we use WGCNA to partition the features.

#set seed so that results are reproducible

#extract features and covariates from ctg data set
X <- ctg[, 2:22]
NSP <- ctg[, 1]
net = blockwiseModules(X, power = 6, minModuleSize = 1)

We then extract the module membership of each feature.

module_membership <- net$colors

We then set up values for various tuning parameters. Fuzzy forests first screens out unimportant features from each module via recursive feature elimination. Then it selects the top $k$ features where $k$ is prespecified by the user. screening_params contains tuning parameters pertaining to the elimination of features within modules. select_params contains tuning parameters pertaining to the elimination of features surviving this initial screening step.

Note that because there are only 21 covariates minModuleSize must be set to 1 (by default it is 30).

net = blockwiseModules(X, power = 6, minModuleSize = 1, nThreads = 1)
module_membership <- net$colors
mtry_factor <- 1; min_ntree <- 500;  drop_fraction <- .5; ntree_factor <- 1
nodesize <- 1; final_ntree <- 500
screen_params <- screen_control(drop_fraction = drop_fraction,
                                 keep_fraction = .25, min_ntree = min_ntree,
                                ntree_factor = ntree_factor,
                                mtry_factor = mtry_factor)
select_params <- select_control(drop_fraction = drop_fraction,
                                number_selected = 5,
                                min_ntree = min_ntree,
                                ntree_factor = ntree_factor,
                                mtry_factor = mtry_factor)

Tips for Setting Tuning Parameters

Fuzzy forests is then fit using the function ff:

ff_fit <- ff(X, NSP, module_membership = module_membership,
            screen_params = screen_params, select_params=select_params,
            final_ntree = 500)

Likewise, fuzzy forests may also be fit via the wff function. Ideally, tuning parameters for WGCNA should be selected with care. Ideally, the resulting modules should be scientifically meaningful. For convenience and to make it easier to get started using fuzzy forests, wff automatically carries out WGCNA. Parameters for WGCNA are input through the object WGCNA_params

WGCNA_params <- WGCNA_control(p = 6, minModuleSize = 1, nThreads = 1)
wff_fit <- wff(X, NSP, WGCNA_params = WGCNA_params,
              screen_params = screen_params,
              select_params = select_params,
              final_ntree = final_ntree,
              num_processors = 1, nodesize = nodesize)
ff_fit <- example_ff

wff and ff both return objects of type fuzzy_forest, a list containing the results of fuzzy forests. A list of the top $k$ features is returned in a data.frame via the following call:

rankings <- ff_fit$feature_list

The highest ranked feature is ASTV or "percentage of time with abnormal long term variability." Followed by "mean value of short term variability", "histogram mean", "mean value of long term variability", and "minimum of FHR histogram."

After features are selected, a random forest is fit using these selected features. This random forest is accessed by the following command and can be used to obtain predictions for new data. Note that the mse reported below is overly optimistic. For classification, the reported error rates will also be overly optimistic. The recursive feature elimination biases the usual out of bag error rate.

final_rf <- ff_fit$final_rf
final_rf_mse <- tail(final_rf$mse, 1)
cat(" warning!", "\n", "biased estimate of the mse:", final_rf_mse)

The function modplot may be applied to objects of type fuzzy_forest to obtain a graph depicting which modules are over-represented in the list of the most important features. In this case, the turquoise module appears to be slightly over-represented.




The variable importances of the selected features can be graphically displayed by using the function varImpPlot from the package randomForest.


