designSampleSizeClassification: Estimate the mean predictive accuracy and mean protein...

Description Usage Arguments Details Value Author(s) Examples

View source: R/designSampleSizeClassification.R

Description

Estimate the mean predictive accuracy and mean protein importance over all the simulated datasets

Usage

1
2
3
4
5
6
7
designSampleSizeClassification(
  simulations,
  classifier = "rf",
  top_K = 10,
  parallel = FALSE,
  ...
)

Arguments

simulations

A list of simulated datasets It should be the name of the output of simulateDataset function.

classifier

A string specifying which classfier to use. This function uses function ‘train’ from package caret. The options are 1) rf (random forest calssifier, default option). 2) nnet (neural network), 3) svmLinear (support vector machines with linear kernel), 4) logreg (logistic regression), and 5) naive_bayes (naive_bayes).

top_K

the number of proteins selected as important features (biomarker candidates). All the proteins are ranked in descending order based on its importance to separate different groups and the 'top_K' proteins are selected as important features.

parallel

Default is FALSE. If TRUE, parallel computation is performed.

Details

This function fits the classification model, in order to classify the subjects in each simulated training dataset (the output of simulateDataset). Then the fitted model is validated on the (simulated) validation set (the output of simulateDataset). Two performance are reported :

(1) the mean predictive accuracy : The function trains classifier on each simulated training dataset and reports the predictive accuracy of the trained classifier on the validation data (output of simulateDataset function). Then these predictive accuracies are averaged over all the simulation.

(2) the mean protein importance : It represents the importance of a protein in separating different groups. It is estimated on each simulated training dataset using function 'varImp' from package caret. Please refer to the help file of 'varImp' about how each classifier calculates the protein importance. Then these importance values for each protein are averaged over all the simulation.

The list of classification models trained on each simulated dataset, the predictive accuracy on the validation set predicted by the corresponding classification model and the importance value for all the proteins estimated by the corresponding classification model are also reported.

Value

num_proteins is the number of simulated proteins. It should be the same as one of the output from simulateDataset, called num_proteins

num_samples is a vector with the number of simulated samples in each condition. It should be the same as one of the output from simulateDataset, called num_samples

mean_predictive_accuracy is the mean predictive accuracy over all the simulated datasets, which have same ‘num_proteins’ and ‘num_samples’.

mean_feature_importance is the mean protein importance vector over all the simulated datasets, the length of which is ‘num_proteins’.

predictive_accuracy is a vector of predictive accuracy on each simulated dataset.

feature_importance is a matrix of feature importance, where rows are proteins and columns are simulated datasets.

Author(s)

Ting Huang, Meena Choi, Sumedh Sankhe, Olga Vitek

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
data(OV_SRM_train)
data(OV_SRM_train_annotation)

# num_simulations = 10: simulate 10 times
# protein_rank = "mean", protein_select = "high", and protein_quantile_cutoff = 0.0:
# select the proteins with high mean abundance based on the protein_quantile_cutoff
# expected_FC = "data": fold change estimated from OV_SRM_train
# simulate_validation = FALSE: use input OV_SRM_train as validation set
# valid_samples_per_group = 50: 50 samples per condition
simulated_datasets <- simulateDataset(data = OV_SRM_train,
                                      annotation = OV_SRM_train_annotation,
                                      log2Trans = FALSE,
                                      num_simulations = 10,
                                      samples_per_group = 50,
                                      protein_rank = "mean",
                                      protein_select = "high",
                                      protein_quantile_cutoff = 0.0,
                                      expected_FC = "data",
                                      list_diff_proteins =  NULL,
                                      simulate_validation = FALSE,
                                      valid_samples_per_group = 50)

# run classification on simulated datasets without parallel computation
classification_results <- designSampleSizeClassification(simulations = simulated_datasets,
                                                         parallel = FALSE)

classification_results$num_proteins

# a vector with the number of simulated samples in each condition
classification_results$num_samples

# the mean predictive accuracy over all the simulated datasets,
# which have same 'num_proteins' and 'num_samples'
classification_results$mean_predictive_accuracy

# the mean protein importance vector over all the simulated datasets,
# the length of which is 'num_proteins'.
head(classification_results$mean_feature_importance)

Vitek-Lab/MSstatsSampleSize documentation built on Aug. 28, 2020, 10:39 a.m.