simulateDataset: Simulate datasets with the given number of biological...

Description Usage Arguments Details Value Author(s) Examples

View source: R/simulateDataset.R

Description

Simulate datasets with the given number of biological replicates and proteins based on the input data

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
simulateDataset(
  data,
  annotation,
  num_simulations = 10,
  expected_FC = "data",
  list_diff_proteins = NULL,
  select_simulated_proteins = "proportion",
  protein_proportion = 1,
  protein_number = 1000,
  samples_per_group = 50,
  simulate_validation = FALSE,
  valid_samples_per_group = 50
)

Arguments

data

Protein abundance data matrix. Rows are proteins and columns are biological replicates (samples).

annotation

Group information for samples in data. ‘BioReplicate’ for sample ID and ‘Condition’ for group information are required. ‘BioReplicate’ information should match with column names of ‘data’.

num_simulations

Number of times to repeat simulation experiments (Number of simulated datasets). Default is 10.

expected_FC

Expected fold change of proteins. The first option (Default) is "data", indicating the fold changes are directly estimated from the input ‘data’. The second option is a vector with predefined fold changes of listed proteins. The vector names must match with the unique information of Condition in ‘annotation’. One group must be selected as a baseline and has fold change 1 in the vector. The user should provide list_diff_proteins, which users expect to have the fold changes greater than 1. Other proteins that are not available in ‘list_diff_proteins’ will be expected to have fold change = 1

list_diff_proteins

Vector of proteins names which are set to have fold changes greater than 1 between conditions. If user selected ‘expected_FC= "data" ’, this should be NULL.

select_simulated_proteins

The standard to select the simulated proteins among data. It can be 1) "proportion" of total number of proteins in the input data or 2) "number" to specify the number of proteins. "proportion" indicates that user should provide the value for ‘protein_proportion’ option. "number" indicates that user should provide the value for ‘protein_number’ option.

protein_proportion

Proportion of total number of proteins in the input data to simulate. For example, input data has 1,000 proteins and user selects ‘protein_proportion=0.1’. Proteins are ranked in decreasing order based on their mean abundance across all the samples. Then, 1,000 * 0.1 = 100 proteins will be selected from the top list to simulate. Default is 1.0, which meaans that all the proteins will be used.

protein_number

Number of proteins to simulate. For example, ‘protein_number=1000’. Proteins are ranked in decreasing order based on their mean abundance across all the samples and top ‘protein_number’ proteins will be selected to simulate. Default is 1000.

samples_per_group

Number of samples per group to simulate. Default is 50.

simulate_validation

Default is FALSE. If TRUE, simulate the validation set; otherwise, the input ‘data’ will be used as the validation set.

valid_samples_per_group

Number of validation samples per group to simulate. This option works only when user selects ‘simulate_validation=TRUE’. Default is 50.

Details

This function simulate datasets with the given numbers of biological replicates and proteins based on the input dataset (input for this function). The function fits intensity-based linear model on the input data in order to get variance and mean abundance, using estimateVar function. Then it uses variance components and mean abundance to simulate new training data with the given sample size and protein number. It outputs the number of simulated proteins, a vector with the number of simulated samples in a condition, the list of simulated training datasets, the input preliminary dataset and the (simulated) validation dataset.

Value

num_proteins is the number of simulated proteins. It should be set up by parameters, named protein_proportion or protein_number

num_samples is a vector with the number of simulated samples in each condition. It should be same as the parameter, samples_per_group

input_X is the input protein abundance matrix ‘data’.

input_Y is the condition vector for the input 'data.

simulation_train_Xs is the list of simulated protein abundance matrices. Each element of the list represents one simulation.

simulation_train_Ys is the list of simulated condition vectors. Each element of the list represents one simulation.

valid_X is the validation protein abundance matrix, which is used for classification.

valid_Y is the condition vector of validation samples.

Author(s)

Ting Huang, Meena Choi, Olga Vitek.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
data(OV_SRM_train)
data(OV_SRM_train_annotation)

# num_simulations = 10: simulate 10 times
# expected_FC = "data": fold change estimated from OV_SRM_train
# select_simulated_proteins = "proportion":
# select the simulated proteins based on the proportion of total proteins
# simulate_validation = FALSE: use input OV_SRM_train as validation set
# valid_samples_per_group = 50: 50 samples per condition
simulated_datasets <- simulateDataset(data = OV_SRM_train,
                                      annotation = OV_SRM_train_annotation,
                                      num_simulations = 10,
                                      expected_FC = "data",
                                      list_diff_proteins =  NULL,
                                      select_simulated_proteins = "proportion",
                                      protein_proportion = 1.0,
                                      protein_number = 1000,
                                      samples_per_group = 50,
                                      simulate_validation = FALSE,
                                      valid_samples_per_group = 50)

# the number of simulated proteins
simulated_datasets$num_proteins

# a vector with the number of simulated samples in each condition
simulated_datasets$num_samples

# the list of simulated protein abundance matrices
# Each element of the list represents one simulation
head(simulated_datasets$simulation_train_Xs[[1]]) # first simulation

# the list of simulated condition vectors
# Each element of the list represents one simulation
head(simulated_datasets$simulation_train_Ys[[1]]) # first simulation

MSstatsSampleSize documentation built on Nov. 8, 2020, 4:53 p.m.