filterRedundant: This functions removes redundant features from a data.frame

Description Usage Arguments Details Value Note Author(s) See Also Examples

Description

Prior computing proportion of overlap between ranked vector of features it is necessary to remove the redundant features. This can be accomplished using a number of methods implemeted in the filterRedundant function, as explained below.

Usage

1
2
3
filterRedundant(object,
    method=c("maxORmin", "geoMean", "mean", "median","random"),
    idCol=1, byCol=2, absolute=TRUE, decreasing=TRUE, trim=0, ...)

Arguments

object

a data.frame from which redundant features (rows) must be removed.

method

character. The method used for removing redundancy. Currently available methods are: maxORmin, geoMean, random, mean, median, (see Details below).

idCol

character or numeric. Name or index of the column containing redundant identifiers (e.g. ENTREZID, SYMBOLS, ...).

byCol

character or numeric. Name or index of the column containing the ranking statistics (used only with maxORmin method).

absolute

logical. Indicates whether the absolute statistics, as defined by byCol, should be used when reordering (used only with maxORmin method).

decreasing

logical. Indicates whether reodering should be decreasing or not (used only with maxORmin method).

trim

numeric. Indicates whether a trimmed mean should be computed (used only with mean method).

...

further arguments to be passed (not currently implemented).

Details

The maxORmin method removes redundant features by selecting the rows that correspond to the maximum or minimum value of a selected statistics. With this approach redundant features are first ranked in increasing or decreasing order, as defined by the decreasing argument, using the ranking statistics defined by byCol, either in their original or absolute scale, as defined by absolute argument. Subsequently data.frame rows corresponding to redundant identifiers are removed, after these have been identified in the column defined by the idCol, using the duplicated function.

The mean, median, geoMean, and random methods provide alternative ways for summarizing numerical values corresponding to redundant features, as defined by the idCol argument: mean takes the average, median the median, geoMean the geometric mean, random select a random value.

Value

A data.frame with fewer rows with respect to the input one, unique by the identifier specified by the idCol argument.

Note

filterRedundant is a utility function providing various methods to remove redundant rows from a data.frame. The choice of the method depends on the nature of the values, and the final goal. Therefore caution should be used when taking the mean or the median across few values, or passing the arguments with the minORmax method (for instance it would make no sense at all to use a decreasing ordering if the ranking statistics is a p-value).

Author(s)

Luig Marchionni <marchion@jhu.edu>

See Also

See duplicated.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
###load data
data(matchBoxExpression)

###check whether there are redundant identifiers
sapply(matchBoxExpression,nrow)

###the column name for the identifiers
idCol <- "SYMBOL"

###the column name for the ranking statistics
byCol <- "t"

###use lapply to remove redundancy from all data.frames
###default method is "maxORmin"
newMatchBoxExpression <- lapply(matchBoxExpression, filterRedundant, idCol=idCol, byCol=byCol)

###recheck number of rows
sapply(newMatchBoxExpression, nrow)

marchion/matchBox documentation built on May 9, 2019, 4:07 p.m.