collapseDataset: collapseDataset
In stefanavey/aveytoolkit: Toolkit of Helper Functions

Description Usage Arguments Details Value Author(s) Examples

View source: R/aveytoolkit_collapseDataset.R

Collapses a dataset from probes to gene symbols.

collapseDataset(
  exprsVals,
  platform = NULL,
  mapVector = NULL,
  oper = max,
  prefer = c("none", "up", "down"),
  singleProbeset = FALSE,
  returnProbes = FALSE,
  deProbes = NULL,
  debug = FALSE
)

`exprsVals`	a matrix or data.frame of numeric values with rownames denoting the identifiers.
`platform`	the microarray platform the data comes from for extracting the gene symbols
`mapVector`	a uniquely named character vector with names specififying the current identifiers (probes matching the rownames of exprsVals) and the values of the vector specifying the gene symbols (or other identifier to collapse to).
`oper`	the operation used to choose which probe when multiple probes map to the same gene. Default is max which will calculate the maximum of the average.
`prefer`	one of "none", "up", or "down", can be abbreviated.
`singleProbeset`	If `TRUE`, the operation applies to the average over all conditions and all values for a gene will come from one probeset. Otherwise, if `FALSE`, the operation applies to the probesets over all conditions and the values for a gene may come from different probe sets . Default is `FALSE` for compatability reasons but `TRUE` is recommended.
`returnProbes`	if `TRUE`, a list of the collapsed expression matrix and the probes are both returned (see return).
`deProbes`	a list with named vectors "up" and "down" giving the names of up and downregulated probes
`debug`	When TRUE, things will be printed out to help debug errors

This function is designed to work for microarray data but can work for any sort of numeric matrix for which multiple rows need to be collapsed. The aggregate function would probably work better and speed this up but this code is the slow brute force way to do it.

If singleProbeset is set to FALSE, the default for compatability reasons but untested and not recommended, the values for each sample will be taken from the maximum across any probe that maps to that gene. This means that a gene's expression values may be a composition of values from different probes rather than a single probe. Most users will not need to use the 'prefer' argument. If prefer is "up", when multiple deProbes match the same gene, the upregulated will be chosen. Similary for "down". Default is "none" and the probe with the 'oper' (default max) will be chosen.

Note that it is possible for multiple probes to have the same operation (oper) over all conditions and, in this case, I've decided arbitarily to choose the first one.

If returnProbes is TRUE, a list containing the collapsed dataset in $exprsVals and the probes chosen in $probeSets. Otherwise, if returnProbes is FALSE, only the expression matrix is returned.

Christopher Bolen, Modified by Stefan Avey

## Trivial Example showing basic functionality
fakeExpr <- matrix(rnorm(50, mean=8, sd=1), ncol=5, nrow=10,
                   dimnames=list(probes=paste("probe", 1:10, sep='_'),
                     samples=paste("sample", LETTERS[1:5], sep='_')))
mv <- rep(paste("Gene", LETTERS[1:5], sep='_'), each=2) # mapVector
names(mv) <- rownames(fakeExpr)
res <- collapseDataset(fakeExpr, mapVector=mv, oper=max,
                       singleProbeset=TRUE, # recommend setting singleProbeset to TRUE
                       returnProbes=TRUE) 
res$probes
## between probe_1 and probe_2, probe_2 was chosen for Gene_A
## between probe_3 and probe_4, probe_4 was chosen for Gene_B
## etc.

res$exprsVals                           # collapsed expression values

## only difference is in rownames, numbers are identical
all.equal(res$exprsVals, fakeExpr[res$probes,])