predict_RF: Predict sample class based on gene pair-based random forest...

View source: R/functions.R

predict_RFR Documentation

Predict sample class based on gene pair-based random forest classifier

Description

predict_RF predicts sample class based on pair-based random forest classifier

Usage

predict_RF(classifier,
           Data,
           impute = FALSE,
           impute_reject = 0.67,
           impute_kNN = 5,
           verbose = TRUE)

Arguments

classifier

classifier as a rule_based_RandomForest object, generated by train_RF function

Data

a matrix, dataframe, ExpressionSet, or data_object generated by ReadData function. Samples as columns and row as features/genes.

impute

logical. To determine if missed genes and NA values should be imputed or not. The non missed rules will be used to detemine the closest samples in the training binary matrix (i.e. which is stored in the classifier object). For each sample, the mode value for nearest samples in the training data will be assigned to the missed rules. Default is FALSE.

impute_reject

a number between 0 and 1 indicating the threshold of the missed rules in the sample. Based on this threshold the sample will be rejected (i.e. skipped if higher than the impute_reject threshold) and the missed rules will not be imputed in this sample. Default is 0.67. NOTE, The results object will not have any results for this sample.

impute_kNN

interger determines the number of the nearest samples in the training data to be used in the imputation. Default is 5. It is not recommended to use large number (i.e. >10).

verbose

a logical value indicating whether processing messages will be printed or not. Default is TRUE.

Value

returns predictions object as "ranger.prediction" class from ranger package. If the RF classifier was trained with probability=TRUE then the results will contain the scores for the classes, and to help the user to get clearer outputs predict_RF adds a new slot (i.e. results$predictions_classes) contains a vector with the prediction based on the highest scores in results$predictions. If the RF classifier was trained with probability=FALSE then the results will contain the final class but no scores are provided in results$predictions. In case a sample was rejected in the imputation process (passed the reject cutoff) then it will not be included in the prediction results. This should be kept in mind in case the user wants to match the input samples with the results for the confusion matrix for example. To help the user to get clearer outputs predict_RF adds the sample names as names/row names to the factor/matrix in results$predictions.

Author(s)

Nour-al-dain Marzouka <nour-al-dain.marzouka at med.lu.se>

Examples

# generate random data
Data <- matrix(runif(8000), nrow=100, ncol=80,
               dimnames = list(paste0("G",1:100), paste0("S",1:80)))

# generate random labels
L <- sample(x = c("A","B","C","D"), size = 80, replace = TRUE)

# generate random platform labels
P <- sample(c("P1","P2","P3"), size = 80, replace = TRUE)

# create data object
object <- ReadData(Data = Data,
                   Labels = L,
                   Platform = P,
                   verbose = FALSE)

# sort genes
genes_RF <- sort_genes_RF(data_object = object,
                          seed=123456, verbose = FALSE)

# to get an idea of how many genes we will use
# and how many rules will be generated
# summary_genes_RF(sorted_genes_RF = genes_RF,
#                  genes_altogether = c(10,20,50,100,150,200),
#                  genes_one_vs_rest = c(10,20,50,100,150,200))

# creat and sort rules
# rules_RF <- sort_rules_RF(data_object = object,
#                           sorted_genes_RF = genes_RF,
#                           genes_altogether = 100,
#                           genes_one_vs_rest = 100,
#                           seed=123456,
#                           verbose = FALSE)

# parameters <- data.frame(
#   gene_repetition=c(3,2,1),
#   rules_one_vs_rest=0,
#   rules_altogether=c(2,3,10),
#   run_boruta=c(FALSE,"produce_error",FALSE),
#   plot_boruta = FALSE,
#   num.trees=c(100,200,300),
#   stringsAsFactors = FALSE)
# parameters

# Or you can use expand.grid to generate dataframe with all parameter combinations
# parameters <- expand.grid(
#   gene_repetition=c(3,2,1),
#   rules_one_vs_rest=0,
#   rules_altogether=c(2,3,10),
#   num.trees=c(100,500,1000),
#   stringsAsFactors = FALSE)
# parameters


# test <- optimize_RF(data_object = object,
#                     sorted_rules_RF = rules_RF,
#                     test_object = NULL,
#                     overall = c("Accuracy"),
#                     byclass = NULL, verbose = FALSE,
#                     parameters = parameters)
# test
# test$summary[which.max(test$summary$Accuracy),]
#
# # train the final model
# # it is preferred to increase the number of trees and rules in case you have
# # large number of samples and features
# # for quick example, we have small number of trees and rules here
# # based on the optimize_RF results we will select the parameters
# RF_classifier <- train_RF(data_object = object,
#                           gene_repetition = 1,
#                           rules_altogether = 0,
#                           rules_one_vs_rest = 10,
#                           run_boruta = FALSE,
#                           plot_boruta = FALSE,
#                           probability = TRUE,
#                           num.trees = 300,
#                           sorted_rules_RF = rules_RF,
#                           boruta_args = list(),
#                           verbose = TRUE)
#
# # training accuracy
# # get the prediction labels
# # if the classifier trained using probability	= FALSE
# training_pred <- RF_classifier$RF_scheme$RF_classifier$predictions
# if (is.factor(training_pred)) {
#   x <- as.character(training_pred)
# }
#
# # if the classifier trained using probability	= TRUE
# if (is.matrix(training_pred)) {
#   x <- colnames(training_pred)[max.col(training_pred)]
# }
#
# # training accuracy
# caret::confusionMatrix(data =factor(x),
#                 reference = factor(object$data$Labels),
#                 mode = "everything")

# not to run
# visualize the binary rules in training dataset
# plot_binary_RF(Data = object,
#                classifier = RF_classifier,
#                prediction = NULL, as_training = TRUE,
#                show_scores = TRUE,
#                top_anno = "ref",
#                show_predictions = TRUE,
#                title = "Training data")

# not to run
# Extract and plot the proximity matrix from the classifier for the training data
# it takes long time for large data
# proximity_mat <- proximity_matrix_RF(object = object,
#                       classifier = RF_classifier,
#                       plot=TRUE,
#                       return_matrix=TRUE,
#                       title = "Test",
#                       cluster_cols = TRUE)

# not to run
# predict
# test_object # any test data
# results <- predict_RF(classifier = RF_classifier, impute = TRUE,
#                       Data = test_object)
#
# # visualize the binary rules in training dataset
# plot_binary_RF(Data = test_object,
#                classifier = RF_classifier,
#                prediction = results, as_training = FALSE,
#                show_scores = TRUE,
#                top_anno = "ref",
#                show_predictions = TRUE,
#                title = "Test data")

NourMarzouka/multiclassPairs documentation built on May 3, 2023, 7:20 p.m.