Description Usage Arguments Details Value Author(s) Examples
View source: R/designSampleSizeClassification.R
Estimate the mean predictive accuracy and mean protein importance over all the simulated datasets
1 2 3 4 5 6 7 | designSampleSizeClassification(
simulations,
classifier = "rf",
top_K = 10,
parallel = FALSE,
...
)
|
simulations |
A list of simulated datasets It should be the name of the
output of |
classifier |
A string specifying which classfier to use. This function uses function ‘train’ from package caret. The options are 1) rf (random forest calssifier, default option). 2) nnet (neural network), 3) svmLinear (support vector machines with linear kernel), 4) logreg (logistic regression), and 5) naive_bayes (naive_bayes). |
top_K |
the number of proteins selected as important features (biomarker candidates). All the proteins are ranked in descending order based on its importance to separate different groups and the 'top_K' proteins are selected as important features. |
parallel |
Default is FALSE. If TRUE, parallel computation is performed. |
This function fits the classification model,
in order to classify the subjects in each simulated training dataset (the
output of simulateDataset
).
Then the fitted model is validated on the (simulated) validation set (the
output of simulateDataset
).
Two performance are reported :
(1) the mean predictive accuracy : The function trains classifier on each
simulated training dataset and reports the predictive accuracy of the trained
classifier on the validation data (output of simulateDataset
function).
Then these predictive accuracies are averaged over all the simulation.
(2) the mean protein importance : It represents the importance of a protein in separating different groups. It is estimated on each simulated training dataset using function 'varImp' from package caret. Please refer to the help file of 'varImp' about how each classifier calculates the protein importance. Then these importance values for each protein are averaged over all the simulation.
The list of classification models trained on each simulated dataset, the predictive accuracy on the validation set predicted by the corresponding classification model and the importance value for all the proteins estimated by the corresponding classification model are also reported.
num_proteins is the number of simulated proteins. It should be the same as one of the output from simulateDataset, called num_proteins
num_samples is a vector with the number of simulated samples in each condition. It should be the same as one of the output from simulateDataset, called num_samples
mean_predictive_accuracy is the mean predictive accuracy over all the simulated datasets, which have same ‘num_proteins’ and ‘num_samples’.
mean_feature_importance is the mean protein importance vector over all the simulated datasets, the length of which is ‘num_proteins’.
predictive_accuracy is a vector of predictive accuracy on each simulated dataset.
feature_importance is a matrix of feature importance, where rows are proteins and columns are simulated datasets.
Ting Huang, Meena Choi, Sumedh Sankhe, Olga Vitek
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | data(OV_SRM_train)
data(OV_SRM_train_annotation)
# num_simulations = 10: simulate 10 times
# protein_rank = "mean", protein_select = "high", and protein_quantile_cutoff = 0.0:
# select the proteins with high mean abundance based on the protein_quantile_cutoff
# expected_FC = "data": fold change estimated from OV_SRM_train
# simulate_validation = FALSE: use input OV_SRM_train as validation set
# valid_samples_per_group = 50: 50 samples per condition
simulated_datasets <- simulateDataset(data = OV_SRM_train,
annotation = OV_SRM_train_annotation,
log2Trans = FALSE,
num_simulations = 10,
samples_per_group = 50,
protein_rank = "mean",
protein_select = "high",
protein_quantile_cutoff = 0.0,
expected_FC = "data",
list_diff_proteins = NULL,
simulate_validation = FALSE,
valid_samples_per_group = 50)
# run classification on simulated datasets without parallel computation
classification_results <- designSampleSizeClassification(simulations = simulated_datasets,
parallel = FALSE)
classification_results$num_proteins
# a vector with the number of simulated samples in each condition
classification_results$num_samples
# the mean predictive accuracy over all the simulated datasets,
# which have same 'num_proteins' and 'num_samples'
classification_results$mean_predictive_accuracy
# the mean protein importance vector over all the simulated datasets,
# the length of which is 'num_proteins'.
head(classification_results$mean_feature_importance)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.