optimize.sd_selection: Optimization of sd selection
In BioTIP: BioTIP: An R package for characterization of Biological Tipping-Point

Description Usage Arguments Value Author(s) See Also Examples

View source: R/BioTIP_update_4_09282020_v3.R

The optimize.sd_selection filters a multi-state dataset based on a cutoff value for standard deviation per state and optimizes. By default, a cutoff value of 0.01 is used. Suggested if each state contains more than 10 samples.

optimize.sd_selection(
  df,
  samplesL,
  B = 100,
  percent = 0.8,
  times = 0.8,
  cutoff = 0.01,
  method = c("other", "reference", "previous", "itself", "longitudinal reference"),
  control_df = NULL,
  control_samplesL = NULL
)

`df`	A dataframe of numerics. The rows and columns represent unique transcript IDs (geneID) and sample names, respectively.
`samplesL`	A list of n vectors, where n equals to the number of states. Each vector gives the sample names in a state. Note that the vectors (sample names) has to be among the column names of the R object 'df'.
`B`	An integer indicating number of times to run this optimization, default 1000.
`percent`	A numeric value indicating the percentage of samples will be selected in each round of simulation.
`times`	A numeric value indicating the percentage of `B` times a transcript need to be selected in order to be considered a stable signature.
`cutoff`	A positive numeric value. Default is 0.01. If < 1, automatically goes to select top x percentage transcripts using the a selecting method (which is either the `reference`, `other` or `previous` stage), e.g. by default it will select top 1 percentage of the transcripts.
`method`	Selection of methods from `reference`, `other`, `previous`, default uses `other`. Partial match enabled. `itself`, or `longitudinal reference`. Some specific requirements for each option: `reference`, the reference has to be the first. `previous`, make sure `sampleL` is in the right order from benign to malign. `itself`, make sure the cutoff is smaller than 1. `longitudinal reference` make sure control_df and control_samplesL are not NULL. The row numbers of control_df is the same as df and all transcript in df are also in control_df.
`control_df`	A count matrix with unique loci as row names and samples names of control samples as column names, only used for method `longitudinal reference`.
`control_samplesL`	A list of characters with stages as names of control samples, required for method 'longitudinal reference'.

A list of dataframe of filtered transcripts with the highest standard deviation are selected from df based on a cutoff value assigned. The resulting dataframe represents a subset of the raw input df.

Zhezhen Wang zhezhen@uchicago.edu

sd_selection

counts = matrix(sample(1:100, 30), 2, 30)
colnames(counts) = 1:30
row.names(counts) = paste0('loci', 1:2)
cli = cbind(1:30, rep(c('state1', 'state2', 'state3'), each = 10))
colnames(cli) = c('samples', 'group')
samplesL <- split(cli[, 1], f = cli[, 'group'])
test_sd_selection <- optimize.sd_selection(counts,  samplesL,  B = 3,  cutoff =0.01)