designSampleSizeClassificationPlots: Visualization for sample size calculation in classification

Description Usage Arguments Details Value Author(s) Examples

View source: R/designSampleSizeClassificationPlots.R

Description

To illustrate the mean classification accuracy and protein importance under different sample sizes through predictive accuracy plot and protein importance plot.

Usage

1
2
3
4
5
6
7
8
9
designSampleSizeClassificationPlots(
  data,
  optimal_threshold = 0.001,
  num_important_proteins_show = 10,
  protein_importance_plot = TRUE,
  predictive_accuracy_plot = TRUE,
  save.pdf = FALSE,
  ...
)

Arguments

data

A list of outputs from function designSampleSizeClassification. Each element represents the results under a specific sample size. The input should include at least two simulation results with different sample sizes.

optimal_threshold

The maximal cutoff for deciding the optimal sample size. Default is 0.0001. Large cutoff can lead to smaller optimal sample size whereas small cutoff produces large optimal sample size.

num_important_proteins_show

The number of proteins to show in protein importance plot.

protein_importance_plot

TRUE(default) draws protein importance plot.

predictive_accuracy_plot

TRUE(default) draws predictive accuracy plot.

save.pdf

A logical input, determines to save the plots as a pdf or not, the pdf plot is saved in the current working directory, name of the created file is displayed on the console and logged for easier access

...

Arguements that can be passed to ggplot2::theme functions to alter the visuals

Details

This function visualizes for sample size calculation in classification. Mean predictive accuracy and mean protein importance under each sample size is from the input ‘data’, which is the output from function designSampleSizeClassification.

To illustrate the mean predictive accuracy and protein importance under different sample sizes, it generates two types of plots in pdf files as output: (1) The predictive accuracy plot, The X-axis represents different sample sizes and y-axis represents the mean predictive accuracy. The reported sample size per condition can be used to design future experiment

(2) The protein importance plot includes multiple subplots. The number of subplots is equal to ‘list_samples_per_group’. Each subplot shows the top 'num_important_proteins_show' most important proteins under each sample size. The Y-axis of each subplot is the protein name and X-axis is the mean protein importance under the sample size.

Value

predictive accuracy plot is the mean predictive accuracy under different sample sizes. The X-axis represents different sample sizes and y-axis represents the mean predictive accuracy.

protein importance plot includes multiple subplots. The number of subplots is equal to 'list_samples_per_group'. Each subplot shows the top 'num_important_proteins_show' most important proteins under each sample size. The Y-axis of each subplot is the protein name and X-axis is the mean protein importance under the sample size.

a numeric value which is the estimated optimal sample size per group for the input dataset for classification problem.

Author(s)

Ting Huang, Meena Choi, Sumedh Sankhe, Olga Vitek.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
data(OV_SRM_train)
data(OV_SRM_train_annotation)

# simulate different sample sizes
# 1) 10 biological replicats per group
# 2) 25 biological replicats per group
# 3) 50 biological replicats per group
# 4) 100 biological replicats per group
list_samples_per_group <- c(10, 25, 50, 100)

# save the simulation results under each sample size
multiple_sample_sizes <- list()
for(i in seq_along(list_samples_per_group)){
    # run simulation for each sample size
    simulated_datasets <- simulateDataset(data = OV_SRM_train,
                                          annotation = OV_SRM_train_annotation,
                                          log2Trans = FALSE,
                                          num_simulations = 10, # simulate 10 times
                                          samples_per_group = list_samples_per_group[i],
                                          protein_rank = "mean",
                                          protein_select = "high",
                                          protein_quantile_cutoff = 0.0,
                                          expected_FC = "data",
                                          list_diff_proteins =  NULL,
                                          simulate_valid = FALSE,
                                          valid_samples_per_group = 50)

    # run classification performance estimation for each sample size
    res <- designSampleSizeClassification(simulations = simulated_datasets,
                                          parallel = TRUE)

    # save results
    multiple_sample_sizes[[i]] <- res
}

## make the plots and save them to disk
designSampleSizeClassificationPlots(data = multiple_sample_sizes, save.pdf = TRUE)

## make accuracy plot print in the Plots panes
designSampleSizeClassificationPlots(data = multiple_sample_sizes, predictive_accuracy_plot = TRUE)

## make accuracy plot print in the Plots panes
designSampleSizeClassificationPlots(data = multiple_sample_sizes, =predictive_accuracy_plot = T)

Vitek-Lab/MSstatsSampleSize documentation built on Aug. 28, 2020, 10:39 a.m.