TADrfe: A wrapper function passed to 'caret::rfe' to apply recursive...

View source: R/TADrfe.R

TADrfeR Documentation

A wrapper function passed to caret::rfe to apply recursive feature elimination (RFE) on binned domain data as a feature reduction technique for random forests. Backward elimination is performed from p down to 2, by powers of 2, where p is the number of features in the data.

Description

A wrapper function passed to caret::rfe to apply recursive feature elimination (RFE) on binned domain data as a feature reduction technique for random forests. Backward elimination is performed from p down to 2, by powers of 2, where p is the number of features in the data.

Usage

TADrfe(
  trainData,
  tuneParams = list(ntree = 500, nodesize = 1),
  cvFolds = 5,
  cvMetric = "Accuracy",
  verbose = FALSE
)

Arguments

trainData

Data frame, the binned data matrix to built a random forest classifiers (can be obtained using createTADdata). Required.

tuneParams

List, providing ntree and nodesize parameters to feed into randomForest. Required.

cvFolds

Numeric, number of k-fold cross-validation to perform. Required.

cvMetric

Character, performance metric to use to choose optimal tuning parameters (one of either "Kappa", "Accuracy", "MCC","ROC","Sens", "Spec", "Pos Pred Value", "Neg Pred Value"). Default is "Accuracy".

verbose

Logical, controls whether or not details regarding modeling should be printed out. Default is TRUE.

Value

A list containing: 1) the performances extracted at each of the k folds and, 2) Variable importances among the top features at each step of RFE. For 1) 'Variables' - the best subset of features to consider at each iteration, 'MCC' (Matthews Correlation Coefficient), 'ROC' (Area under the receiver operating characteristic curve), 'Sens' (Sensitivity), 'Spec' (Specificity), 'Pos Pred Value' (Positive predictive value), 'Neg Pred Value' (Negative predictive value), 'Accuracy', and the corresponding standard deviations across the cross-folds. For 2) 'Overall' - the variable importance, 'var' - the feature name, 'Variables' - the number of features that were considered at each cross-fold, and 'Resample' - the cross-fold

Examples

# Read in ARROWHEAD-called TADs at 5kb
data(arrowhead_gm12878_5kb)

#Extract unique boundaries
bounds.GR <- extractBoundaries(domains.mat = arrowhead_gm12878_5kb,
                               filter = FALSE,
                               CHR = "CHR22",
                               resolution = 5000)

# Read in GRangesList of 26 TFBS
data(tfbsList)

# Create the binned data matrix for CHR22 using:
# 5 kb binning,
# oc-type predictors from 26 different TFBS from the GM12878 cell line, and
# random under-sampling
tadData <- createTADdata(bounds.GR = bounds.GR,
                         resolution = 5000,
                         genomicElements.GR = tfbsList,
                         featureType = "oc",
                         resampling = "rus",
                         trainCHR = "CHR22",
                         predictCHR = NULL)

# Perform RFE for fully grown random forests with 100 trees using 5-fold CV
# Evaluate performances using accuracy
rfe_res <- TADrfe(trainData = tadData[[1]],
                  tuneParams = list(ntree = 100, nodesize = 1),
                  cvFolds = 5,
                  cvMetric = "Accuracy",
                  verbose = TRUE)

dozmorovlab/preciseTAD documentation built on April 26, 2022, 9:42 a.m.