keepMaxStatProbe: Filter multiple probesets matching to the same gene by...

View source: R/filter.R

keepMaxStatProbeR Documentation

Filter multiple probesets matching to the same gene by keeping the one with the maximum statistic (by default the variance).

Description

The function filters features (commonly probesets) in an ExpressionSet object. It does not affect genes with only one feature present, or genes without an valid annotation (see details below). For genes with multiple probesets, the function calculates the statistic of each probeset across all samples and filter probesets by only keeping the one with the maximum of variance. Thereby an ExpressionSet returned by the function has only one probeset matching each gene.

Usage

keepMaxStatProbe(
  eset,
  probe.index.name,
  keepNAprobes = TRUE,
  stat = function(x) mean(x, na.rm = TRUE),
  ...
)

Arguments

eset

An ExpressionSet

probe.index.name

The column name of the fData(eset) data matrix, used as the index of gene to determine which features are matched to the same gene.

keepNAprobes

Logical, determines whether genes without an valid index name should kept or left out. See details below.

stat

Function or character, a function (or the name referring to it) which takes a vector of numerical values, and returns one value as the statistic, e.g. sd for standard deviations.

...

Parameters passed to the stat function. One of the most frequent used option might be na.rm=TRUE, see details and examples.

Details

Names of probesets are determined by the featureNames(eset) function.

The column of probe.index.name in the fData(eset) data.frame determines the index of genes, for example the Entrez GeneID, to which probesets are matched. Those genes without a valid index, whose index is either an empty string or NA, can be set to be left out by keepNAprobes=FALSE. If the option is set as TRUE, then these genes are kept in the returning object.

The stat function should only return one statistic, most favorably not NA, by taking a vector of numerical values. Most statistics can be calculated in a robust way by setting na.rm=TRUE. This option should be always used whenver possible. Otherwise when there is one or more missing value of a probeset, its statistic will probably be NA and this will lead to discard the probeset. Even worse, when all probesets matching to a gene have NAs, the gene will be totally filtered out, which is usually not desired. Therefore, set na.rm=TRUE through the ... option (see examples below) whenever possible.

Value

An filtered ExpressionSet.

Note

Note that when the statistics of two or more probesets tie (having the same value), the probeset chosed could be random (the probeset with its name ranked first when multiple names are converted into a factor vector).

Author(s)

Jitao David Zhang <jitao_david.zhang@roche.com>

Examples


library("Biobase")

example.mat <- matrix(c(1,1,3,4, 2,2,3,3, 4,5,6,7, 7,8,9,10), ncol=4, byrow=TRUE)
example.eset <- new("ExpressionSet", exprs=example.mat)

featureNames(example.eset) <- c("1a","1b","2","3")
fData(example.eset)$geneid <- c(1,1,2,3)

## keep probesets with the maximal variance
example.sd <- keepMaxStatProbe(example.eset, probe.index.name="geneid", stat=sd)
featureNames(example.sd)

## keep probesets with the maximal Median Absolute Deviation (MAD)
example.mad <- keepMaxStatProbe(example.eset, probe.index.name="geneid", stat=mad)
featureNames(example.mad)

## keep probesets with the maximal mean value
example.mean <- keepMaxStatProbe(example.eset,
probe.index.name="geneid", stat=mean)
featureNames(example.mean)

## note that NA value may cause problems, it is a good practice to make
## the stat function _resist_ to NA
na.eset <- example.eset
exprs(na.eset)[1,1] <- NA

## Not run: 
## prone to error
na.mean <- keepMaxStatProbe(na.eset,
probe.index.name="geneid",stat=mean)
featureNames(na.mean)
## better
na.mean.narm <- keepMaxStatProbe(na.eset,
probe.index.name="geneid",na.rm=TRUE)
featureNames(na.mean.narm)

## End(Not run)



bedapub/ribiosExpression documentation built on Sept. 2, 2023, 4:37 a.m.