Description Usage Arguments Details Value Author(s) Examples
View source: R/simulateDataset.R
Simulate datasets with the given number of biological replicates and proteins based on the input data
1 2 3 4 5 6 7 8 9 10 11 12 13 | simulateDataset(
data,
annotation,
num_simulations = 10,
expected_FC = "data",
list_diff_proteins = NULL,
select_simulated_proteins = "proportion",
protein_proportion = 1,
protein_number = 1000,
samples_per_group = 50,
simulate_validation = FALSE,
valid_samples_per_group = 50
)
|
data |
Protein abundance data matrix. Rows are proteins and columns are biological replicates (samples). |
annotation |
Group information for samples in data. ‘BioReplicate’ for sample ID and ‘Condition’ for group information are required. ‘BioReplicate’ information should match with column names of ‘data’. |
num_simulations |
Number of times to repeat simulation experiments (Number of simulated datasets). Default is 10. |
expected_FC |
Expected fold change of proteins. The first option (Default) is "data", indicating the fold changes are directly estimated from the input ‘data’. The second option is a vector with predefined fold changes of listed proteins. The vector names must match with the unique information of Condition in ‘annotation’. One group must be selected as a baseline and has fold change 1 in the vector. The user should provide list_diff_proteins, which users expect to have the fold changes greater than 1. Other proteins that are not available in ‘list_diff_proteins’ will be expected to have fold change = 1 |
list_diff_proteins |
Vector of proteins names which are set to have fold changes greater than 1 between conditions. If user selected ‘expected_FC= "data" ’, this should be NULL. |
select_simulated_proteins |
The standard to select the simulated proteins among data. It can be 1) "proportion" of total number of proteins in the input data or 2) "number" to specify the number of proteins. "proportion" indicates that user should provide the value for ‘protein_proportion’ option. "number" indicates that user should provide the value for ‘protein_number’ option. |
protein_proportion |
Proportion of total number of proteins in the input data to simulate. For example, input data has 1,000 proteins and user selects ‘protein_proportion=0.1’. Proteins are ranked in decreasing order based on their mean abundance across all the samples. Then, 1,000 * 0.1 = 100 proteins will be selected from the top list to simulate. Default is 1.0, which meaans that all the proteins will be used. |
protein_number |
Number of proteins to simulate. For example, ‘protein_number=1000’. Proteins are ranked in decreasing order based on their mean abundance across all the samples and top ‘protein_number’ proteins will be selected to simulate. Default is 1000. |
samples_per_group |
Number of samples per group to simulate. Default is 50. |
simulate_validation |
Default is FALSE. If TRUE, simulate the validation set; otherwise, the input ‘data’ will be used as the validation set. |
valid_samples_per_group |
Number of validation samples per group to simulate. This option works only when user selects ‘simulate_validation=TRUE’. Default is 50. |
This function simulate datasets with
the given numbers of biological replicates and
proteins based on the input dataset (input for this function).
The function fits intensity-based linear model on the input data
in order to get variance and mean abundance, using estimateVar
function.
Then it uses variance components and mean abundance to simulate new training data
with the given sample size and protein number.
It outputs the number of simulated proteins,
a vector with the number of simulated samples in a condition,
the list of simulated training datasets,
the input preliminary dataset and
the (simulated) validation dataset.
num_proteins is the number of simulated proteins. It should be set up by parameters, named protein_proportion or protein_number
num_samples is a vector with the number of simulated samples in each condition. It should be same as the parameter, samples_per_group
input_X is the input protein abundance matrix ‘data’.
input_Y is the condition vector for the input 'data.
simulation_train_Xs is the list of simulated protein abundance matrices. Each element of the list represents one simulation.
simulation_train_Ys is the list of simulated condition vectors. Each element of the list represents one simulation.
valid_X is the validation protein abundance matrix, which is used for classification.
valid_Y is the condition vector of validation samples.
Ting Huang, Meena Choi, Olga Vitek.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | data(OV_SRM_train)
data(OV_SRM_train_annotation)
# num_simulations = 10: simulate 10 times
# expected_FC = "data": fold change estimated from OV_SRM_train
# select_simulated_proteins = "proportion":
# select the simulated proteins based on the proportion of total proteins
# simulate_validation = FALSE: use input OV_SRM_train as validation set
# valid_samples_per_group = 50: 50 samples per condition
simulated_datasets <- simulateDataset(data = OV_SRM_train,
annotation = OV_SRM_train_annotation,
num_simulations = 10,
expected_FC = "data",
list_diff_proteins = NULL,
select_simulated_proteins = "proportion",
protein_proportion = 1.0,
protein_number = 1000,
samples_per_group = 50,
simulate_validation = FALSE,
valid_samples_per_group = 50)
# the number of simulated proteins
simulated_datasets$num_proteins
# a vector with the number of simulated samples in each condition
simulated_datasets$num_samples
# the list of simulated protein abundance matrices
# Each element of the list represents one simulation
head(simulated_datasets$simulation_train_Xs[[1]]) # first simulation
# the list of simulated condition vectors
# Each element of the list represents one simulation
head(simulated_datasets$simulation_train_Ys[[1]]) # first simulation
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.