Filter: Filter genes
In metaOmic/preproc: Preprocessing of Normalized Gene Expression Data

Description Usage Arguments Details Value Author(s) Examples

Filter genes with low means and low variances.

1	Filter(datasets, data.type, del.perc = c(0.3, 0.3), threshold = 1)

`datasets`	a list of gene expression matrice. Each matrix is for one study. Each row of the matrix is for one gene and each column is for one sample. The row names are gene symbols.
`data.type`	a character string to specify the type of data in `datasets`. It should be `"microarray"`, `"RNAseq-FPKM"`, or `"RNAseq-count"`.
`del.perc`	a numeric vector with two elements, which specify the percentage of genes to be filtered in the two sequential steps of gene filtering when `data.type` is `"microarray"` or `"RNAseq-FPKM"`. The default is `c(0.3, 0.3)`. See Details.
`threshold`	a numeric value to specify the threshold when `data.type` is `"RNAseq-count"`. The default is `1`. See details.

When data.type is "microarray" or "RNAseq-FPKM", two sequential steps of gene filtering are performed. In the first step, the genes with very low expressions are filtered out. These genes are identified with small average expression values across studies. Specifically, mean intensities of each gene across all samples in each study are calculated and the corresponding ranks are obtained. The sum of such ranks across studies of each gene is calculated and genes with the lowest del.perc[1] percent rank sum are considered un-expressed genes (i.e. small expression intensities) and filtered out. Similarly, in the second step, the non-informative (small variation) genes are filtered out by replacing mean intensity in the first step with standard deviation. Genes with the lowest del.perc[2] percent rank sum of standard deviations are filtered out.

When data.type is "RNAseq-count", the genes with very low counts are filtered out. These genes are identified with minimum of mean counts across studies.

A list of gene expression matrice after filtering. Each matrix is for one study. Each row of the matrix is for one gene and each column is for one sample. The row names are gene symbols.

Lin Wang, Schwannden Kuo

data(datasets.eg)
data(preproc.option)
SinglePreproc <- function(x) {
  x <- Annotate(dataset=x, id.type = "ProbeID", platform=PLATFORM.hgu133plus2)
  x <- Impute(dataset=x)
  x <- PoolReplicate(dataset=x)
}
datasets.eg <- lapply(datasets.eg, SinglePreproc)
datasets.eg <- Merge(datasets=datasets.eg)
# Filter for matrix
res <- Filter(datasets=datasets.eg, data.type=DTYPE.microarray, del.perc=c(0.3, 0.2))
# Filter for Study
study <- new("Study", name="test", dtype=DTYPE.microarray, datasets=datasets.eg)
res <- Filter(datasets=study, data.type=DTYPE.microarray, del.perc=c(0.3, 0.2))