Description Usage Arguments Details Value Author(s) Examples
optimize_RF takes a different sets of parameters to be used as training parameters. optimize_RF passes each set of the parameters to train_RF function, then optimize_RF returns the accuracies and related measurements (i.e. number of used genes and rules) for each trained RF model based on each set of parameters. Accuracies can be calculated based on the training data or by applying the trained RF model on another testing data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | optimize_RF(data_object,
sorted_rules_RF,
parameters,
overall = c("Accuracy", "Kappa", "AccuracyLower",
"AccuracyUpper", "AccuracyNull", "AccuracyPValue",
"McnemarPValue")[1:2],
byclass = c("Sensitivity", "Specificity",
"Pos Pred Value", "Neg Pred Value",
"Precision", "Recall", "F1", "Prevalence",
"Detection Rate", "Detection Prevalence",
"Balanced Accuracy" )[c(11)],
seed = 123456,
test_object = NULL,
impute = TRUE,
impute_reject = 0.67,
verbose = FALSE)
|
data_object |
Data object with labels generated by ReadData function |
sorted_rules_RF |
sorted rules object generated by sort_rules_RF function |
parameters |
a dataframe with the variables that the RF model will be trained based on. Column names should match arguments used in train_RF function. Each row represents one trial (model), e.g. a dataframe with 10 rows means you want to check the performance of 10 different RF models based on 10 different set of parameters. |
overall |
a vector with the names of the overall performance measurements to be reported in the summary table in results. It can be one or more of these measurements: "Accuracy", "Kappa", "AccuracyLower", "AccuracyUpper", "AccuracyNull", "AccuracyPValue", "McnemarPValue". Default is c("Accuracy", "Kappa"). These masurements based on confusionMatrix function output in caret package. |
byclass |
a vector with the names of the performance measurements for individual classes to be reported in the summary table in results. It can be one or more of these measurements: "Sensitivity", "Specificity", "Pos Pred Value", "Neg Pred Value", "Precision", "Recall", "F1", "Prevalence", "Detection Rate", "Detection Prevalence", "Balanced Accuracy". Default is "Balanced Accuracy". These masurements based on confusionMatrix function output in caret package. |
seed |
seed to be used in the training process for reproducibility. |
test_object |
data object with labels generated by ReadData to be used as testing data. If this object is provided then the accuracies and performance results will be based on this object not the training data. |
impute |
logical to be passed to predict_RF when test_object is used. To impute missed genes and NA values in test_object. Default is TRUE. |
impute_reject |
a number between 0 and 1 to be passed to predict_RF when test_object is used. It indicate the threshold of the missed rules in the sample. Based on this threshold the sample will be rejected (i.e. skipped) and the missed rules will not be imputed in this sample. Default is 0.67. |
verbose |
a logical value indicating whether processing messages will be printed or not. Default is FALSE. |
optimize_RF
helps the user to optimize parameters to be used in train_RF
function for a given training dataset.
return optimize_RF_output object which is a list caintains:
summary |
dataframe contains the input parameters, number of genes and rules in the model, and the selected overall and by class performance measurements. Each trials (i.e. set of parameters) as on row. |
confusionMatrix |
list of confusionMatrix objects generated by caret package, which contains the fulloverall and by class performance for each trial |
errors |
list of errors generated by trials |
calls |
the call which used to generate this object. |
Nour-al-dain Marzouka <nour-al-dain.marzouka at med.lu.se>
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | # generate random data
Data <- matrix(runif(8000), nrow=100, ncol=80,
dimnames = list(paste0("G",1:100), paste0("S",1:80)))
# generate random labels
L <- sample(x = c("A","B","C","D"), size = 80, replace = TRUE)
# generate random platform labels
P <- sample(c("P1","P2","P3"), size = 80, replace = TRUE)
# create data object
object <- ReadData(Data = Data,
Labels = L,
Platform = P,
verbose = FALSE)
# sort genes
genes_RF <- sort_genes_RF(data_object = object,
seed=123456, verbose = FALSE)
# to get an idea of how many genes we will use
# and how many rules will be generated
# summary_genes_RF(sorted_genes_RF = genes_RF,
# genes_altogether = c(10,20,50,100,150,200),
# genes_one_vs_rest = c(10,20,50,100,150,200))
# creat and sort rules
# rules_RF <- sort_rules_RF(data_object = object,
# sorted_genes_RF = genes_RF,
# genes_altogether = 100,
# genes_one_vs_rest = 100,
# seed=123456,
# verbose = FALSE)
# parameters <- data.frame(
# gene_repetition=c(3,2,1),
# rules_one_vs_rest=0,
# rules_altogether=c(2,3,10),
# run_boruta=c(FALSE,"produce_error",FALSE),
# plot_boruta = FALSE,
# num.trees=c(100,200,300),
# stringsAsFactors = FALSE)
# parameters
# Or you can use expand.grid to generate dataframe with all parameter combinations
# parameters <- expand.grid(
# gene_repetition=c(3,2,1),
# rules_one_vs_rest=0,
# rules_altogether=c(2,3,10),
# num.trees=c(100,500,1000),
# stringsAsFactors = FALSE)
# parameters
# test <- optimize_RF(data_object = object,
# sorted_rules_RF = rules_RF,
# test_object = NULL,
# overall = c("Accuracy"),
# byclass = NULL, verbose = FALSE,
# parameters = parameters)
# test
# test$summary[which.max(test$summary$Accuracy),]
#
# # train the final model
# # it is preferred to increase the number of trees and rules in case you have
# # large number of samples and features
# # for quick example, we have small number of trees and rules here
# # based on the optimize_RF results we will select the parameters
# RF_classifier <- train_RF(data_object = object,
# gene_repetition = 1,
# rules_altogether = 0,
# rules_one_vs_rest = 10,
# run_boruta = FALSE,
# plot_boruta = FALSE,
# probability = TRUE,
# num.trees = 300,
# sorted_rules_RF = rules_RF,
# boruta_args = list(),
# verbose = TRUE)
#
# # training accuracy
# # get the prediction labels
# # if the classifier trained using probability = FALSE
# training_pred <- RF_classifier$RF_scheme$RF_classifier$predictions
# if (is.factor(training_pred)) {
# x <- as.character(training_pred)
# }
#
# # if the classifier trained using probability = TRUE
# if (is.matrix(training_pred)) {
# x <- colnames(training_pred)[max.col(training_pred)]
# }
#
# # training accuracy
# caret::confusionMatrix(data =factor(x),
# reference = factor(object$data$Labels),
# mode = "everything")
# not to run
# visualize the binary rules in training dataset
# plot_binary_RF(Data = object,
# classifier = RF_classifier,
# prediction = NULL, as_training = TRUE,
# show_scores = TRUE,
# top_anno = "ref",
# show_predictions = TRUE,
# title = "Training data")
# not to run
# Extract and plot the proximity matrix from the classifier for the training data
# it takes long time for large data
# proximity_mat <- proximity_matrix_RF(object = object,
# classifier = RF_classifier,
# plot=TRUE,
# return_matrix=TRUE,
# title = "Test",
# cluster_cols = TRUE)
# not to run
# predict
# test_object # any test data
# results <- predict_RF(classifier = RF_classifier, impute = TRUE,
# Data = test_object)
#
# # visualize the binary rules in training dataset
# plot_binary_RF(Data = test_object,
# classifier = RF_classifier,
# prediction = results, as_training = FALSE,
# show_scores = TRUE,
# top_anno = "ref",
# show_predictions = TRUE,
# title = "Test data")
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.