jammanorm | R Documentation |
Normalize data using MA-plot logic
jammanorm(
x,
controlGenes = NULL,
minimum_mean = 0,
controlSamples = NULL,
centerGroups = NULL,
useMedian = TRUE,
useMean = NULL,
noise_floor = -Inf,
noise_floor_value = noise_floor,
verbose = FALSE,
...
)
x |
|
minimum_mean |
|
controlSamples |
|
centerGroups |
|
useMedian |
|
useMean |
(deprecated) |
noise_floor , noise_floor_value |
|
verbose |
logical indicating whether to print verbose output. |
... |
additional parameters sent to downstream functions,
|
This function normalizes data using an approach analogous to data viewed in MA-plots. Normalization is applied by shifting data along the y-axis so the mean (or median) expression among control genes is zero, indicated on an MA-plot as y=0.
Note: This method should be performed only after reviewing the MA-plots, to ensure the assumptions are met. Similarly, data can also be viewed in MA-plots after normalization to confirm and review the effect of normalization.
It is useful to run jammaplot()
data after jammanorm()
to visualize the effect of this normalization. If data is not
centered at y=0, the parameters should be adjusted.
This method effectively reinforces the assumption that the mean log fold
change for control genes is expected to be zero. When useMedian=TRUE
it reinforces the assumption that the log fold change for the
majority of control genes should be zero.
Therefore, the assumptions may be summarized as follows:
The principle assumption is that the set of controlGenes
,
whose mean expression is at or above the minimum_mean
value,
are unchanged within the respective centerGroups
. For typical
whole genome transcript microarray and RNA-seq experiments,
this assumption is typically valid when using useMedian=TRUE
.
For experiments with specific reference genes, or housekeeper
genes, this assumption may only be true for those specific genes.
The data signal is assumed to be a roughly linear representation
of the relative abundance of each measured entity, which is usually
true for log-transformed microarray intensity, or log-transformed
RNA-seq read counts. For QPCR or TaqMan, this assumption is valid
for the direct CT cycle threshold values, or after log-transformation
of the exponentiated CT values, for example 2^(40-CT).
All that said, a straightforward way to visualize this assumption
is with MA-plots, to confirm that signal is horizontal across
the full signal range - either for the majority of all genes,
or for the specific controlGenes
used for normalization.
The variability among control genes should not be more than twice
the median absolute deviation across other samples within the
relevant centerGroups
. Effectively this assumption means the
control genes on the MA-plot should not show wide spread along the
y-axis.
In cases where some samples show non-horizontal signal across the
MA-plot, the data is not conforming to a consistent and proportional
signal across the response range of the experiment. In effect, it
means signal is being compressed, or expanded along the response
as compared to other samples in the same centerGroups
. In this
scenario, the best normalization method may be
limma::normalizeQuantiles()
, limma::normalizeCyclicLoess()
,
or vsn::vsn()
. These methods adjust the distribution of signal
to enforce consistency across samples.
In general, the signal distribution itself should not be adjusted
unless necessary, in order to retain as much information from the
underlying technology as possible. This method jammanorm()
is
intended to apply linear normalization, which effectively shifts
the entire signal for a sample up or down relative to other samples.
This scenario is effective for technologies such as QPCR, TaqMan, Nanostring counts, and RNA-seq counts or RNA-seq pseudocounts.
When the MA-plot demonstrates non-horizontal signal, it is most often the result one or both of these influences:
batch effect, imposed either by different upstream sample processing steps among the samples being tested, or
platform technology that tends to produce relative signal strength but not absolute quantitative signal, commonly seen with microarray hybridization technologies such as Affymetrix, Illumina, Agilent, SomaLogic, or Myriad RBM.
Note that any upstream sample amplification technique may also impose non-linear effects on the molecules being measured.
One method to test for a batch effect is to define centerGroups
to include batch, so the data will be centered for each batch
independently. If this centering resolves the non-horizontal
signal, then batch is very likely to be a component to be modeled
in the experiment. See limma::removeBatchEffect()
. The batch effect
adjustment by limma::removeBatchEffect()
and sva::ComBat()
almost exactly subtract the batch component from the signal.
That said, it may or may not be ideal to apply batch adjustment
prior to running downstream statistical tests, as opposed to
including batch as a covariate term in the statistical
model used for testing, example when using DESeq2
.
The main benefit of applying batch adjustment at this step is to
visualize data downstream consistent with the method used by
those statistical tests, or when running a clustering technique
that does not have the capability of applying appropriate
batch effect modeling.
This normalization is a "linear normalization" in that it uniformly shifts data up or down relative to other samples, without modifying the relative distribution of signal. It is very similar to housekeeper normalization, geometric mean normalization, and global signal scaling, which are all also "linear normalization" methods. An example of non-linear normalization is quantile or VSN normalization.
The control genes can be defined upfront with controlGenes
,
which can be housekeeper genes, or a custom subset of genes.
The controlGenes
are filtered to require the mean or median
expression at or above minimum_mean
in order to be used
during normalization.
The minimum_mean
threshold is useful and important to match
the variability seen in the MA-plots. For example data below
a certain x-axis value may have very high variability, and
should usually not be used for normalization.
When the MA-plot after normalization does not show signal centered
at y=0, the most common and effective adjustment is to apply
minimum_mean
to require controGenes
to have expression
at or above this threshold. The next most effective option is
useMedian=TRUE
which will center the majority of genes at
y=0 instead of the overall mean at y=0.
The end result is very similar to other housekeeper normalization methods which typically define normalization factors by calculating the geometric mean of log-transformed housekeeper gene abundances. Such approaches usually work in part because higher expressed housekeeper genes usually have lower variability, which keeps the influence on geometric mean reasonably consistent across a broad range of expression. That said, genes with higher expression have more influence on the geometric mean than genes with much lower expression.
However, the strategy with jammanorm()
is to assert that
the mean difference from average expression for the controlGenes
should be equal to zero. The effect is applied evenly across
control genes by evaluating the mean or median difference from y=0
for each sample.
Note that some platform technologies generate a noise threshold, below which data may be skewed up or down relative to other samples. It is recommended to ignore this type of skew below the threshold when determining whether data is horizontal on MA-plots.
For example Nanostring data includes a series of positive and negative control probes, and a suitable noise threshold is either the midpoint between the lowest positive and highest negative probe, or the lowest positive probe. When this noise threshold is applied, data above the noise threshold is typically horizontal, although data below the threshold may be skewed up or down depending upon the effective input RNA concentration.
numeric
matrix
whose columns have been normalized,
with the following named attributes
:
"nf"
numeric vector of normalization factors;
"hk"
a character
vector of controlGenes for each sample;
"hk_count"
the integer
number of controlGenes for each sample.
Other jam matrix functions:
centerGeneData()
,
jammacalc()
,
matrix_to_column_rank()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.