centerGeneData: Center gene data
In jmw86069/jamma: MA-plots for omics data

centerGeneData

R Documentation

Center gene data

Description

Performs per-row centering on a numeric matrix

Usage

centerGeneData(
  x,
  centerGroups = NULL,
  na.rm = TRUE,
  controlSamples = NULL,
  useMedian = TRUE,
  rmOutliers = FALSE,
  madFactor = 5,
  controlFloor = NA,
  naControlAction = c("na", "row", "floor", "min"),
  naControlFloor = 0,
  rowStatsFunc = NULL,
  returnGroupedValues = FALSE,
  returnGroups = FALSE,
  mean = NULL,
  verbose = FALSE,
  ...
)

Arguments

`x`	`numeric` matrix of input data. See assumptions, that data is assumed to be log2-transformed, or otherwise appropriately transformed.
`centerGroups`	`character` vector of group names, or `NULL` if there are no groups.
`na.rm`	`logical` indicating whether NA values should be ignored for summary statistics. This argument is passed to the corresponding row stats function. Frankly, this value should be `na.rm=TRUE` for all stat functions by default, for example `mean(..., na.rm=TRUE)` should be default.
`controlSamples`	`character` vector of values in `colnames(x)` which defines the columns to use when calculating group summary values.
`useMedian`	`logical` indicating whether to use group median values when calculating summary statistics `TRUE`, or group means `FALSE`. In either case, when `rowStatsFunc` is provided, it is used instead.
`rmOutliers`	`logical` indicating whether to perform outlier detection and removal prior to row group stats. This argument is passed to `jamba::rowGroupMeans()`. Note that outliers are only removed during the row group summary step, and not in the centered data.
`madFactor`	`numeric` value passed to `jamba::rowGroupMeans()`, indicating the MAD factor threshold to use when `rmOutliers=TRUE`. The MAD of each row group is computed, the overall group median MAD is used to define 1x MAD factor, and any MAD more than `madFactor` times the group median MAD is considered an outlier and is removed. The remaining data is used to compute row group values.
`controlFloor`	`numeric` value used as a minimum for any control summary value during centering. Use `NA` to skip this behavior. When defined, all control group summary values are calculated, then any values below `controlFloor` are set to the `controlFloor` for the purpose of data centering. By default `controlFloor=NA` which imposes no such floor value. However, `controlFloor=0` would be appropriate when zero is defined as effective noise floor after something like background subtraction during the upstream processing or upstream normalization. Using a value above zero would be appropriate when the effective noise floor of a platform is above zero, so that values are not centered relative to noise. For example, if the effective noise floor is 5, then centering should not "amplify" differences from any value less than 5, since in this scenario a value of 5 or less is effectively the same as a value of 5. It has the effect of returning fold changes relative to the effective platform minimum detectable signal.
`naControlAction`	`character` string indicating how to handle the specific scenario when the control group summary value is `NA` for a particular centering operation. `"na"`: default is to return `NA` since 15 - NA = NA. `"row"`: use the summary value across all relevant samples, so the centering is against all non-NA values within the center group. `"floor"`: use the numeric value defined by `naControlFloor`, to indicate a practical noise floor for the centering operation. When `naControlFloor=0` (default) this option effectively keeps non-NA values without centering these values. `"min"`: use the minimum control value as the floor, which effectively defines the floor by the lowest observed summary value across all rows. It assumes rows are generally on the same range of detection, even if not all rows have the same observed range. For example, microarray probes have reasonably similar theoretical range of detection, even if some probes to highly-expressed genes are commonly observed with higher signal. The lowest observed signal effectively sets the minimum detected value.
`rowStatsFunc`	`optional` function used to calculate row group summary values. This function should take a numeric matrix as input, and return a one-column numeric matrix as output, or a numeric vector with length `nrow(x)`. The function should also accept `na.rm` as an argument.
`returnGroupedValues`	`logical` indicating whether to include the numeric matrix of row group values used during centering, returned in the attributes with name `"x_group"`.
`returnGroups`	`logical` indicating whether to return the centering summary data.frame in attributes with name "center_df".
`verbose`	`logical` indicating whether to print verbose output.
`...`	additional arguments are passed to `jamba::rowGroupMeans()`.

Details

This function centers data by subtracting the median or mean for each row.

Columns can be grouped using argument centerGroups. Each group group of columns defined by centerGroups is centered independently.

Data can be centered relative to specific control columns using argument controlSamples. When controlSamples is not supplied, the default behavior is to use all columns. This process is consistent with typical MA-plots.

It may be preferred to define controlSamples in cases where there are known reference samples, against which other samples should be compared.

The controlSamples logic is applied independently to each group defined in centerGroups.

You can confirm the centerGroups and controlSamples are correct in the result data, by accessing the attribute "center_df", see examples below.

Note: This function assumes input data is suitable for centering by subtraction. This data requirement is true for:

most log-transformed gene expression data
quantitative PCR (QPCR) cycle threshold (CT) values
other numeric data that has been suitably transformed to meet reasonable parametric assumption of normality,
rank-transformed data which results in difference in rank
generally speaking, any data where the difference between 5 and 7 (2) is reasonably similar to the difference between 15 and 17 (2).
it may be feasible to perform background subtraction on straight count data, for example sequence coverage at a particular location in a genome.

The data requirement is not true for:

most gene expression data in normal space (hint: if any value is above 100, it is generally not log-transformed)
numeric data that is strongly skewed
generally speaking, any data where the difference between 5 and 7 is not reasonably similar to the difference between 15 and 17. If the percent difference is more likely to be the interesting measure, data may be log-transformed for analysis.

For special cases, rowStatsFunc can be supplied to perform specific group summary calculations per row.

Control groups with NA values (since version 0.0.28.900)

When controlSamples is supplied, and contains all NA values for a given row of data, within relevant centerGroups subsets, the default behavior is defined by naControlAction="NA" below:

naControlAction="na": values are centered versus NA which results in all values NA (current behavior, default).
naControlAction="row": values are centered versus the row, using all samples in the same center group. This action effectively "centers to what we have".
naControlAction="floor": values are centered versus a numeric floor defined by argument naControlFloor. When naControlFloor=0 then values are effectively not centered. However, naControlFloor=10 could for example be used to center values versus a practical noise floor, if the range of detection for a particular experiment starts at 10 as a low value.
naControlAction="min": values are centered versus the minimum observed summary value in the data, which effectively uses the data to define a value for naControlFloor.

The motivation to center versus something other than controlSamples when all measurements for controlSamples are NA is to have a numeric value to indicate that a measurement was detected in non-control columns. This situation occurs in technologies when control samples have very low signal, and in some cases report NA when no measurement is detected within the instrument range of detection.

Examples

x <- matrix(1:100, ncol=10);
colnames(x) <- letters[1:10];
# basic centering
centerGeneData(x);

# grouped centering
centerGeneData(x,
   centerGroups=rep(c("A","B"), c(5,5)));

# centering versus specific control columns
centerGeneData(x,
   controlSamples=letters[c(1:3)]);

# grouped centering versus specific control columns
centerGeneData(x,
   centerGroups=rep(c("A","B"), c(5,5)),
   controlSamples=letters[c(1:3, 6:8)]);

# confirm the centerGroups and controlSamples
x_ctr <- centerGeneData(x,
   centerGroups=rep(c("A","B"), c(5,5)),
   controlSamples=letters[c(1:3, 6:8)],
   returnGroups=TRUE);
attr(x_ctr, "center_df");

jmw86069/jamma documentation built on June 13, 2025, 3:58 p.m.