optimize.sd_selection: Optimization of sd selection

Description Usage Arguments Value Author(s) See Also Examples

View source: R/BioTIP_update_4_09282020_v3.R

Description

The optimize.sd_selection filters a multi-state dataset based on a cutoff value for standard deviation per state and optimizes. By default, a cutoff value of 0.01 is used. Suggested if each state contains more than 10 samples.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
optimize.sd_selection(
  df,
  samplesL,
  B = 100,
  percent = 0.8,
  times = 0.8,
  cutoff = 0.01,
  method = c("other", "reference", "previous", "itself", "longitudinal reference"),
  control_df = NULL,
  control_samplesL = NULL
)

Arguments

df

A dataframe of numerics. The rows and columns represent unique transcript IDs (geneID) and sample names, respectively.

samplesL

A list of n vectors, where n equals to the number of states. Each vector gives the sample names in a state. Note that the vectors (sample names) has to be among the column names of the R object 'df'.

B

An integer indicating number of times to run this optimization, default 1000.

percent

A numeric value indicating the percentage of samples will be selected in each round of simulation.

times

A numeric value indicating the percentage of B times a transcript need to be selected in order to be considered a stable signature.

cutoff

A positive numeric value. Default is 0.01. If < 1, automatically goes to select top x percentage transcripts using the a selecting method (which is either the reference, other or previous stage), e.g. by default it will select top 1 percentage of the transcripts.

method

Selection of methods from reference, other, previous, default uses other. Partial match enabled.

  • itself, or longitudinal reference. Some specific requirements for each option:

  • reference, the reference has to be the first.

  • previous, make sure sampleL is in the right order from benign to malign.

  • itself, make sure the cutoff is smaller than 1.

  • longitudinal reference make sure control_df and control_samplesL are not NULL. The row numbers of control_df is the same as df and all transcript in df are also in control_df.

control_df

A count matrix with unique loci as row names and samples names of control samples as column names, only used for method longitudinal reference.

control_samplesL

A list of characters with stages as names of control samples, required for method 'longitudinal reference'.

Value

A list of dataframe of filtered transcripts with the highest standard deviation are selected from df based on a cutoff value assigned. The resulting dataframe represents a subset of the raw input df.

Author(s)

Zhezhen Wang zhezhen@uchicago.edu

See Also

sd_selection

Examples

1
2
3
4
5
6
7
counts = matrix(sample(1:100, 30), 2, 30)
colnames(counts) = 1:30
row.names(counts) = paste0('loci', 1:2)
cli = cbind(1:30, rep(c('state1', 'state2', 'state3'), each = 10))
colnames(cli) = c('samples', 'group')
samplesL <- split(cli[, 1], f = cli[, 'group'])
test_sd_selection <- optimize.sd_selection(counts,  samplesL,  B = 3,  cutoff =0.01)

BioTIP documentation built on Nov. 8, 2020, 6:27 p.m.