matrix_normalize | R Documentation |
Normalize a numeric data matrix
matrix_normalize(
x,
method = c("quantile", "jammanorm", "limma_batch_adjust", "TMM", "TMMwsp", "RLE"),
apply_log2 = c("ifneeded", "no", "always"),
floor = 0,
floor_value = floor,
enforce_norm_floor = TRUE,
params = list(quantile = list(ties = TRUE), jammanorm = list(controlGenes = NULL,
minimum_mean = 0, controlSamples = NULL, centerGroups = NULL, useMedian = FALSE,
noise_floor = NULL, noise_floor_value = NULL), limma_batch_adjust = list(batch =
NULL, group = NULL), TMM = list(refColumn = NULL, logratioTrim = 0.3, sumTrim = 0.05,
doWeighting = TRUE, Acutoff = NULL), TMMwsp = list(refColumn = NULL, logratioTrim =
0.3, sumTrim = 0.05, doWeighting = TRUE, Acutoff = NULL)),
normgroup = NULL,
subset_columns = NULL,
debug = FALSE,
verbose = TRUE,
...
)
x |
|
method |
|
apply_log2 |
|
floor |
|
floor_value |
|
enforce_norm_floor |
|
params |
|
normgroup |
|
subset_columns |
|
verbose |
|
This function is a wrapper for several relevant normalization methods that operate on a numeric matrix.
numeric
matrix with the same dimensions as the
input matrix x
.
Additional information may be returned as attributes(x)
:
"norm_method"
: a character
string with the method used.
"nf"
: a numeric
vector with normalization factors, returned
only by "jammanorm"
, "TMM"
, and "TMMwsp"
.
"hk"
: a character
vector of rownames(x)
used as housekeeper
controlGenes
by "jammanorm"
.
Quantile-normalization performed by
limma::normalizeQuantiles()
. This method has one
parameter "ties"
passed to limma::normalizeQuantiles()
,
the default here ties=TRUE
which handles tied numeric
expression values in a robust way to avoid unpredictability
otherwise. This option is especially relevant with expression
count data, where integer counts cause a large number
of values to be represented multiple times.
Median-normalization performed by
jamma::jammanorm()
. This method shifts expression
data as shown on MA-plots, so the median expression
is zero across all samples, using only the rows that
meet the relevant criteria.
Some relevant criteria to define rows used for normalization:
controlGenes
defines specific genes to use for
normalization, such as housekeeper genes. It may also
be useful to use detected genes here, so the normalization
only considers those genes defined as detected by
the protocol.
minimum_mean
sets a numeric threshold and requires
the mean expression (shown on the x-axis of the MA-plot)
to be at least this value.
Note that when both controlGenes
and minimum_mean
are defined, both criteria are enforced. So the controlGenes
are also required to have expression of at least minimum_mean
.
Also note that all rows of data are normalized by this method,
only the subset of rows defined by controlGenes
and minimum_mean
are used to compute the normalization factor.
Batch adjustment performed by
limma::removeBatchEffect()
which is intended to apply
batch-adjustment as a form of normalization, but which
does not represent full normalization itself. There are
two relevant parameters: "batch"
which is a vector of
batch values in order of colnames(x)
, and "group"
which is a vector of sample groups in order of colnames(x)
.
normgroup
The normgroup
argument is intended as a convenient method to
apply a normalization method to each independent normgroup
.
This situation is especially useful when a study contains
multiple tissue types, or multiple data types, that may not be
appropriate to normalize directly relative to one another.
For example, one could normalize total RNA-seq and nascent 4sU-seq
data independently, without expectation that the two would ever
have a common frame of reference to normalize one relative to another.
However, both may be amenable to "quantile"
or
"jammanorm"
median normalization.
Similarly, one could normalize each tissue type independently, which may be appropriate when analyzing data that contains very different mammalian tissue organ samples, such as muscle and brain.
It would generally not be appropriate to use quantile normalization across muscle and brain samples, since the overall pattern and distribution of expression values is not expected to be similar. Quantile normalize assumes (and imposes) a common distribution, by adjusting mean expression signal at each quantile to a common mean expression across all samples.
For a rough approximation of cross-tissue normalization, one
could apply "quantile"
normalization within each normgroup
defined
by tissue type, then apply "jammanorm"
median normalization to
apply a linear adjustment of signal across tissue types. The median
normalization does not affect distribution, thus will not affect
intra-tissue contrasts, except by adjusting its overall signal
which may change downstream assumptions regarding signal thresholds.
It is recommended not to compare directly across tissue types. In some cases a two-way contrast may be appropriate, where fold change within one tissue type is compared to the fold change within another tissue type. However, even in that case the two tissue types do not need to be normalized relative to each other upfront - the within-tissue fold change serves as one method of normalizing the observations across tissue types.
Note the floor
and enforce_norm_floor
have recommended
default values floor=0
and enforce_norm_floor=TRUE
.
floor
is applied prior to normalization, typically
to minimize effects of low, noisy signal on the normalization
process itself. Specifically, this floor is used to remove negative
values, which may be by-products of upstream signal processing.
"A measured signal at or below the noise floor of a platform
technology is effectively the same as a signal at the noise floor."
enforce_norm_floor
is applied after normalization, typically
as a convenience, also to prevent low, noisy signal from
contributing to downstream analysis steps.
These defaults will set any assay value at or below 0
to 0
,
and after normalization any values whose input values were
at or below 0
will also be set to 0
to prevent normalizing
a value of 0
to non-zero. Any normalized value at or
below 0
will also be set to 0
to prevent results from
containing negative normalized values.
The assumption for this default is that a value of zero
is not a measurement but represents the lack of a measurement.
Similarly, the intent of floor
is a numeric threshold at or
below there is no confidence in the reported measurement, therefore
values at or below this threshold are treated as equivalent
to the threshold for the purpose of downstream analyses.
Some platforms like QPCR for example, have substantially lower
confidence at high CT values, where expression values
using the equation 2^(40-CT)
might impose a noise threshold
at expression 32 or lower. This noise threshold for QPCR
means any expression measurement of 32 or lower is as likely
to be 32
as it is to be 2
, and therefore any differences
between reported expression of 32
and 2
should not be
considered relevant. Applying floor=32
in this case
accomplishes this goal by setting all values at or below
32
to 32
. Of course when using this method matrix_normalize()
the data should be log2 transformed, which means the floor
should also be log2 transformed, e.g. floor=log2(32)
which is floor=5
.
One alternative might be to set values at or below zero to NA
prior to normalization, and before calling matrix_normalize()
.
In this case, only non-NA values will be used during
normalization according to the method
being used.
Other jamses utilities:
choose_annotation_colnames()
,
contrast2comp_dev()
,
fold_to_log2fold()
,
intercalate()
,
list2im_opt()
,
log2fold_to_fold()
,
make_block_arrow_polygon()
,
mark_stat_hits()
,
point_handedness()
,
point_slope_intercept()
,
shortest_unique_abbreviation()
,
shrinkDataFrame()
,
shrink_df()
,
shrink_matrix()
,
sort_samples()
,
strsplitOrdered()
,
sub_split_vector()
,
update_function_params()
,
update_list_elements()
# use farrisdata real world data if available
if (jamba::check_pkg_installed("farrisdata")) {
suppressPackageStartupMessages(library(SummarizedExperiment))
# test matrix_normalize()
GeneSE <- farrisdata::farrisGeneSE;
imatrix <- assays(GeneSE)$raw_counts;
genes <- rownames(imatrix);
samples <- colnames(imatrix);
head(imatrix);
# matrix_normalize()
# normalize the numeric matrix directly
imatrix_norm <- matrix_normalize(imatrix,
genes=genes,
samples=samples,
method="jammanorm",
params=list(minimum_mean=5))
names(attributes(imatrix_norm))
# review normalization factors
round(digits=3, attr(imatrix_norm, "nf"));
# example for quantile normalization
imatrix_quant <- matrix_normalize(imatrix,
genes=genes,
samples=samples,
method="quantile")
names(attributes(imatrix_quant))
}
# simulate reasonably common expression matrix
set.seed(123);
x <- matrix(rnorm(9000)/4, ncol=9);
colnames(x) <- paste0("sample", LETTERS[1:9]);
rownames(x) <- paste0("gene", jamba::padInteger(seq_len(nrow(x))))
rowmeans <- rbeta(nrow(x), shape1=2, shape2=5)*14+2;
x <- x + rowmeans;
for (i in 1:9) {
x[,i] <- x[,i] + rnorm(1);
}
# display MA-plot with jamma::jammaplot()
jamma::jammaplot(x)
# normalize by jammanorm
xnorm <- matrix_normalize(x, method="jammanorm")
jamma::jammaplot(xnorm, maintitle="method='jammanorm'")
# normalize by jammanorm with housekeeper genes
hk_genes <- sample(rownames(x), 10);
xnormhk <- matrix_normalize(x,
method="jammanorm",
params=list(jammanorm=list(controlGenes=hk_genes)))
jamma::jammaplot(xnormhk,
maintitle="method='jammanorm' with housekeeper genes",
highlightPoints=list(housekeepers=hk_genes),
highlightColor="green");
xnormhk6 <- matrix_normalize(x,
method="jammanorm",
params=list(jammanorm=list(
controlGenes=hk_genes,
minimum_mean=6)))
hk_used <- attr(xnormhk6, "hk")[[1]];
jamma::jammaplot(xnormhk6,
maintitle="method='jammanorm' with housekeeper genes, minimum_mean=6",
highlightPoints=list(housekeepers=hk_genes,
hk_used=hk_used),
highlightColor=c("red", "green"));
# normalize by quantile
xquant <- matrix_normalize(x, method="quantile")
jamma::jammaplot(xquant,
maintitle="method='quantile'")
# simulate higher noise for lower signal
rownoise <- rnorm(prod(dim(x))) * (3 / ((rowmeans*1.5) - 1.5));
xnoise <- x;
xnoise <- xnoise + rownoise;
jamma::jammaplot(xnoise,
maintitle="simulated higher noise at lower signal");
# simulate non-linearity across signal
# sin(seq(from=pi*4/10, to=pi*7/10, length.out=100))-0.8
rowadjust <- (sin(pi * jamba::normScale(rowmeans, from=3.5/10, to=5.5/10)) -0.9) * 20;
xwarp <- xnoise;
xwarp[,3] <- xnoise[,3] + rowadjust;
jamma::jammaplot(xwarp,
maintitle="signal-dependent noise and non-linear effects");
# quantile-normalization is indicated for this scenario
xwarpnorm <- matrix_normalize(xwarp,
method="quantile");
jp <- jamma::jammaplot(xwarpnorm,
maintitle="quantile-normalized: signal-dependent noise and non-linear effects");
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.