get_random_rf_results: get_random_rf_results

Description Usage Arguments Value

View source: R/RF_Utilities.R

Description

Runs a similar pipleline as Run_RF_Pipeline however takes in random scramblings of the class assignments for each sample (row in feature table). The results from this function can act as a null distrubition to compare models against.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
get_random_rf_results(
  feature_table,
  list_of_scrambles,
  metric = "ROC",
  sampling = NULL,
  repeats = 10,
  path,
  nmtry = 6,
  ntree = 1001,
  nfolds = 3,
  ncrossrepeats = 10,
  pro = 0.8,
  list_of_seeds
)

Arguments

feature_table

The feature table that contains the information to be input into the random forest classifier. Note that this table should not include information about the classes that are being predicted.

list_of_scrambles

A list of vectors that is equal to the number of repeats that cross validation should be run. Each item within this list should contain a random scrambling of the classes set to each sample.

metric

A string that indicates whether the pipeline should use AUROC or AUPRC. For AUROC set metric="ROC". For AUPRC set metric="PR". Defaults to "ROC".

sampling

A string indicating that type of sampling that should be done incase of inbalanced class designs. Options include: "up", "down" "SMOTE" and NULL.

repeats

The number of times data should be split into testing and cross-validation datasets.

path

A string representing the PATH were output files should be saved.

nmtry

An integer representing the number of different mtry values that you want to test during cross validation. The values of mtry to test is calculated as follows: mtry <- round(seq(1, number_of_features/3, length=nmtry)). Defaults to 7.

ntree

An integer that represents the number of trees that you want to use during randoom forest construction. Defaults to 1001.

nfolds

An integer that represents the number of folds to used during cross validation. Defaults to 3.

ncrossrepeats

An integer that represents the number of times to run cross validation on k folds. Defaults to 10.

pro

The proporition of samples that should be used for training versus testing during cross validation. Defaults to 0.8

list_of_seeds

A vector containing a number of seeds that should be equal to the number of repeats.

SEED

The random seed used to split the samples during cross validation. Defaults to 1995.

Value

This function returns a list with the following characteristics: "Object[[1]] contains all the median cross validation AUCS from each data split using the best mtry value" "Object[[2]] contains all the test AUC values from each data split" "Object[[3]] contains all the tested mtry values and the median ROC for each from each data split" "Object[[4]] contains the list of important features from the best model selected from each data split" "Object[[5]] contains each caret random forest model from each data split" "This function will also write a csv with cross validation AUCS and test AUCS, to the given path as well as an RDS file that contains the resulting object from this function"


nearinj/RandomForestUtils documentation built on July 30, 2020, 9:51 a.m.