simulateDataset: Simulate datasets with the given number of biological...
In MSstatsSampleSize: Simulation tool for optimal design of high-dimensional MS-based proteomics experiment

Description Usage Arguments Details Value Author(s) Examples

Simulate datasets with the given number of biological replicates and proteins based on the input data

simulateDataset(
  data,
  annotation,
  num_simulations = 10,
  expected_FC = "data",
  list_diff_proteins = NULL,
  select_simulated_proteins = "proportion",
  protein_proportion = 1,
  protein_number = 1000,
  samples_per_group = 50,
  simulate_validation = FALSE,
  valid_samples_per_group = 50
)

`data`	Protein abundance data matrix. Rows are proteins and columns are biological replicates (samples).
`annotation`	Group information for samples in data. ‘BioReplicate’ for sample ID and ‘Condition’ for group information are required. ‘BioReplicate’ information should match with column names of ‘data’.
`num_simulations`	Number of times to repeat simulation experiments (Number of simulated datasets). Default is 10.
`expected_FC`	Expected fold change of proteins. The first option (Default) is "data", indicating the fold changes are directly estimated from the input ‘data’. The second option is a vector with predefined fold changes of listed proteins. The vector names must match with the unique information of Condition in ‘annotation’. One group must be selected as a baseline and has fold change 1 in the vector. The user should provide list_diff_proteins, which users expect to have the fold changes greater than 1. Other proteins that are not available in ‘list_diff_proteins’ will be expected to have fold change = 1
`list_diff_proteins`	Vector of proteins names which are set to have fold changes greater than 1 between conditions. If user selected ‘expected_FC= "data" ’, this should be NULL.
`select_simulated_proteins`	The standard to select the simulated proteins among data. It can be 1) "proportion" of total number of proteins in the input data or 2) "number" to specify the number of proteins. "proportion" indicates that user should provide the value for ‘protein_proportion’ option. "number" indicates that user should provide the value for ‘protein_number’ option.
`protein_proportion`	Proportion of total number of proteins in the input data to simulate. For example, input data has 1,000 proteins and user selects ‘protein_proportion=0.1’. Proteins are ranked in decreasing order based on their mean abundance across all the samples. Then, 1,000 * 0.1 = 100 proteins will be selected from the top list to simulate. Default is 1.0, which meaans that all the proteins will be used.
`protein_number`	Number of proteins to simulate. For example, ‘protein_number=1000’. Proteins are ranked in decreasing order based on their mean abundance across all the samples and top ‘protein_number’ proteins will be selected to simulate. Default is 1000.
`samples_per_group`	Number of samples per group to simulate. Default is 50.
`simulate_validation`	Default is FALSE. If TRUE, simulate the validation set; otherwise, the input ‘data’ will be used as the validation set.
`valid_samples_per_group`	Number of validation samples per group to simulate. This option works only when user selects ‘simulate_validation=TRUE’. Default is 50.

This function simulate datasets with the given numbers of biological replicates and proteins based on the input dataset (input for this function). The function fits intensity-based linear model on the input data in order to get variance and mean abundance, using estimateVar function. Then it uses variance components and mean abundance to simulate new training data with the given sample size and protein number. It outputs the number of simulated proteins, a vector with the number of simulated samples in a condition, the list of simulated training datasets, the input preliminary dataset and the (simulated) validation dataset.

num_proteins is the number of simulated proteins. It should be set up by parameters, named protein_proportion or protein_number

num_samples is a vector with the number of simulated samples in each condition. It should be same as the parameter, samples_per_group

input_X is the input protein abundance matrix ‘data’.

input_Y is the condition vector for the input 'data.

simulation_train_Xs is the list of simulated protein abundance matrices. Each element of the list represents one simulation.

simulation_train_Ys is the list of simulated condition vectors. Each element of the list represents one simulation.

valid_X is the validation protein abundance matrix, which is used for classification.

valid_Y is the condition vector of validation samples.

Ting Huang, Meena Choi, Olga Vitek.

data(OV_SRM_train)
data(OV_SRM_train_annotation)

# num_simulations = 10: simulate 10 times
# expected_FC = "data": fold change estimated from OV_SRM_train
# select_simulated_proteins = "proportion":
# select the simulated proteins based on the proportion of total proteins
# simulate_validation = FALSE: use input OV_SRM_train as validation set
# valid_samples_per_group = 50: 50 samples per condition
simulated_datasets <- simulateDataset(data = OV_SRM_train,
                                      annotation = OV_SRM_train_annotation,
                                      num_simulations = 10,
                                      expected_FC = "data",
                                      list_diff_proteins =  NULL,
                                      select_simulated_proteins = "proportion",
                                      protein_proportion = 1.0,
                                      protein_number = 1000,
                                      samples_per_group = 50,
                                      simulate_validation = FALSE,
                                      valid_samples_per_group = 50)

# the number of simulated proteins
simulated_datasets$num_proteins

# a vector with the number of simulated samples in each condition
simulated_datasets$num_samples

# the list of simulated protein abundance matrices
# Each element of the list represents one simulation
head(simulated_datasets$simulation_train_Xs[[1]]) # first simulation

# the list of simulated condition vectors
# Each element of the list represents one simulation
head(simulated_datasets$simulation_train_Ys[[1]]) # first simulation

MSstatsSampleSize documentation built on Nov. 8, 2020, 4:53 p.m.

MSstatsSampleSize index

README.md MSstatsSampleSize : A package for optimal design of high-dimensional MS-based proteomics experiment

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

MSstatsSampleSize
Simulation tool for optimal design of high-dimensional MS-based proteomics experiment

simulateDataset: Simulate datasets with the given number of biological...
In MSstatsSampleSize: Simulation tool for optimal design of high-dimensional MS-based proteomics experiment

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Related to simulateDataset in MSstatsSampleSize...

R Package Documentation

Browse R Packages

We want your feedback!

MSstatsSampleSize Simulation tool for optimal design of high-dimensional MS-based proteomics experiment

simulateDataset: Simulate datasets with the given number of biological... In MSstatsSampleSize: Simulation tool for optimal design of high-dimensional MS-based proteomics experiment

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Related to simulateDataset in MSstatsSampleSize...

R Package Documentation

Browse R Packages

We want your feedback!

MSstatsSampleSize
Simulation tool for optimal design of high-dimensional MS-based proteomics experiment

simulateDataset: Simulate datasets with the given number of biological...
In MSstatsSampleSize: Simulation tool for optimal design of high-dimensional MS-based proteomics experiment