sd_selection: Selecting Highly Oscillating Transcripts

Description Usage Arguments Value Author(s) See Also Examples

View source: R/BioTIP_update_4_09282020_v3.R

Description

sd_selection pre-selects highly oscillating transcripts from the input dataset df. The dataset must contain multiple sample groups (or 'states'). For each state, the function filters the dataset using a cutoff value for standard deviation. The default cutoff value is 0.01 (i.e., higher than the top 1 percentage standard deviation).

Usage

1
2
3
4
5
6
7
8
sd_selection(
  df,
  samplesL,
  cutoff = 0.01,
  method = c("other", "reference", "previous", "itself", "longitudinal reference"),
  control_df = NULL,
  control_samplesL = NULL
)

Arguments

df

A numeric matrix or data frame. The rows and columns represent unique transcript IDs (geneID) and sample names, respectively.

samplesL

A list of vectors, whose length is the number of states. Each vector gives the sample names in a state. Note that the vectors (sample names) has to be among the column names of the R object 'df'.

cutoff

A positive numeric value. Default is 0.01. If < 1, automatically selects top x transcripts using the a selecting method (which is either the reference, other stages or previous stage), e.g. by default it will select top 1 percentage of the transcripts.

method

Selection of methods from reference,other, previous, default uses other. Partial match enabled.

  • itself, or longitudinal reference. Some specific requirements for each option:

  • reference, the reference has to be the first.

  • previous, make sure sampleL is in the right order from benign to malign.

  • itself, make sure the cutoff is smaller than 1.

  • longitudinal reference make sure control_df and control_samplesL are not NULL. The row numbers of control_df is the same as df and all transcripts in df are also in control_df.

control_df

A count matrix with unique loci as row names and samples names of control samples as column names, only used for method longitudinal reference

control_samplesL

A list of characters with stages as names of control samples, required for method 'longitudinal reference'

Value

sd_selection() A list of data frames, whose length is the number of states. The rows in each data frame are the filtered transcripts with highest standard deviation selected from df and based on an assigned cutoff value. Each resulting data frame represents a subset of the raw input df, with the sample ID of the same state in the column.

Author(s)

Zhezhen Wang zhezhen@uchicago.edu

See Also

optimize.sd_selection

Examples

1
2
3
4
5
6
7
counts = matrix(sample(1:100, 18), 2, 9)
colnames(counts) = 1:9
row.names(counts) = c('loci1', 'loci2')
cli = cbind(1:9, rep(c('state1', 'state2', 'state3'), each = 3))
colnames(cli) = c('samples', 'group')
samplesL <- split(cli[, 1], f = cli[, 'group'])
test_sd_selection <- sd_selection(counts,  samplesL,  0.01)

BioTIP documentation built on Nov. 8, 2020, 6:27 p.m.