#' Normalize SummarizedExperiment data
#'
#' Normalize SummarizedExperiment data
#'
#' This function applies one or more data normalization methods
#' to an input `SummarizedExperiment` object. The normalization is
#' applied to one or more matrix data stored in `assays(se)`,
#' each one is run independently.
#'
#' Note that supplying `genes` and `samples` will apply normalization
#' to only those `genes` and `samples`, and this data will be
#' stored in the full `SummarizedExperiment` object `se` with
#' `NA` values used to fill any values not present in `genes`
#' or `samples`.
#'
#' For example if `assay_names` contains two assay names,
#' and `method` contains two methods, the output will include
#' four normalizations, where each assay name is normalized two ways.
#' The output assay names will be something like `"assay1_method1"`,
#' `"assay1_method2"`, `"assay2_method1"`, `"assay2_method2"`.
#' It is not always necessary to normalize data by multiple different
#' methods, however when two methods are similar and need to be
#' compared, the `SummarizedExperiment` object is a convenient
#' place to store different normalization results for downstream
#' comparison. Further, the method `se_contrast_stats()` is able
#' to apply equivalent statistical contrasts to each normalization,
#' and returns an array of statistical hits which is convenient
#' for direct comparison of results.
#'
#' This method calls `matrix_normalize()` to perform each normalization
#' step, see that function description for details on each method.
#'
#' @family jamses SE utilities
#'
#' @return `SummarizedExperiment` object where the normalized output
#' is added to `assays(se)` using the naming format `method_assayname`.
#'
#' @param se `SummarizedExperiment` object
#' @param method `character` vector indicating which normalization method(s)
#' to apply.
#' * `"quantile"`: quantile normalization via `limma::normalizeQuantiles()`
#' * `"jammanorm"`: log-ratio normalization via `jamma::jammanorm()`
#' * `"limma_batch_adjust"`: batch adjustment via
#' `limma::removeBatchEffect()`, recommended for data visualization,
#' but not recommended for downstream statistical comparisons.
#' * `"TMM"`: trimmed mean of M-values via `edgeR::calcNormFactors()`
#' * `"TMMwsp"`: TMM with singleton pairing via `edgeR::calcNormFactors()`
#' * `"RLE"`: relative log expression via `edgeR::calcNormFactors()`
#' @param assay_names `character` vector or one or more `names(assays(se))`
#' that indicates which numeric matrix to use during normalization. When
#' multiple values are provided, each matrix is normalized independently
#' by each `method`.
#' @param output_method_prefix `character` vector (optional) with custom
#' method prefix values to use when creating the new `assay_name` for
#' each normalization. It must have length equal to `length(method)`,
#' to be applied to each method in order.
#' Note that `output_assay_names` takes priority, and when it is defined
#' the `output_method_prefix` entries are ignored.
#'
#' Consider these arguments:
#' ```R
#' assay_name="counts",
#' method="limma_batch_adjust",
#' output_method_prefix="lba"
#' ```
#' The assay_name created during normalization will be `"lba_counts"`.
#' @param output_assay_names `character` vector (optional) which overrides
#' the default method for defining assay names for normalized data.
#' This vector length must equal `length(method) * length(assay_names)`,
#' and will be applied in the order data is normalized:
#' 1. `assay_names` are iterated.
#' 2. For each value in `assay_names`, each normalization in `method`
#' is applied.
#'
#' Therefore the order of `output_assay_names` could follow this order:
#' `method1_assay1`, `method1_assay2`, `method2_assay1`, `method2_assay2`.
#' @param genes `character` vector (optional) used to define a subset of
#' gene rows in `se` to use for normalization.
#' Values must match `rownames(se)`.
#' @param samples `character` vector (optional) used to define a subset of
#' sample columns in `se` to use for normalization.
#' Values must match `colnames(se)`.
#' @param params `list` (optional) parameters specific to each
#' normalization method, passed to `matrix_normalize()`. Any
#' value which is not defined in the `params` provided will use
#' the default value in `matrix_normalize()`, for example
#' `params=list(jammanorm=list(minimum_mean=2))` will use
#' `minimum_mean=2` then use other default values relevant
#' to the `jammanorm` normalization method.
#' @param normgroup `character` or equivalent vector that defines subgroups
#' of `samples` to be normalized indendently of each normgroup. When
#' `NULL` then all data is normalized together as default.
#' The `normgroup` vector is expected to be in the same order as
#' `samples`, or `names(normgroup)` must contain all `samples`.
#' @param output_sep `character` string used as a delimited between the
#' `method` and the `assay_names` to define the output assay name,
#' for example when `assay_name="counts"`, `method="quantile"`,
#' and `output_sep="_"` the new assay name will be `"quantile_counts"`.
#' @param override `logical` indicating whether to override any pre-existing
#' matrix values with the same output assay name. When `override=FALSE`
#' and the output assay name already exists, the normalization will
#' not be performed.
#' @param populate_mcols `logical` indicating whether to populate
#' normalization details into `mcols(assays(se))`, including
#' the normalization `method`,
#' the source `assay_name` used during normalization, and
#' values from `params`.
#' @param verbose `logical` indicating whether to print verbose output.
#' @param ... additional arguments are passed to `matrix_normalize()`.
#'
#' @examples
#' if (jamba::check_pkg_installed("farrisdata")) {
#'
#' # se_normalize
#' # suppressPackageStartupMessages(library(SummarizedExperiment))
#' GeneSE <- farrisdata::farrisGeneSE;
#' samples <- colnames(GeneSE);
#' genes <- rownames(GeneSE);
#'
#' GeneSE <- se_normalize(GeneSE,
#' genes=genes,
#' samples=samples,
#' assay_names=c("raw_counts", "counts"),
#' method="jammanorm",
#' params=list(jammanorm=list(minimum_mean=5)))
#' SummarizedExperiment::mcols(SummarizedExperiment::assays(GeneSE))
#' names(SummarizedExperiment::assays(GeneSE))
#'
#' # review normalization factor values
#' round(digits=3, attr(
#' SummarizedExperiment::assays(GeneSE)$jammanorm_raw_counts, "nf"))
#'
#' # the data in "counts" was already normalized
#' # so the normalization factors are very near 0 as expected
#' round(digits=3,
#' attr(SummarizedExperiment::assays(GeneSE)$jammanorm_counts, "nf"))
#'
#'
#' # note that housekeeper genes are supplied in params
#' # also this demonstrates output_method_prefix
#' set.seed(123);
#' hkgenes <- sample(rownames(GeneSE), 1000)
#' GeneSE <- se_normalize(GeneSE,
#' genes=genes,
#' samples=samples,
#' assay_names=c("raw_counts"),
#' method="jammanorm",
#' output_method_prefix="hkjammanorm",
#' params=list(jammanorm=list(minimum_mean=5,
#' controlGenes=hkgenes)))
#' SummarizedExperiment::mcols(SummarizedExperiment::assays(GeneSE))
#'
#' # example showing quantile normalization
#' GeneSE <- se_normalize(GeneSE,
#' assay_names=c("raw_counts"),
#' method="quantile")
#' SummarizedExperiment::mcols(SummarizedExperiment::assays(GeneSE))
#'
#' # example showing quantile normalization with custom output_assay_names
#' GeneSE <- se_normalize(GeneSE,
#' assay_names=c("raw_counts"),
#' method="quantile",
#' output_assay_names="newquantile_raw_counts")
#' SummarizedExperiment::mcols(SummarizedExperiment::assays(GeneSE))
#' }
#'
#'
#' @export
se_normalize <- function
(se,
method=c("quantile",
"jammanorm",
"limma_batch_adjust",
"TMM",
"TMMwsp",
"RLE"),
assay_names=NULL,
output_method_prefix=NULL,
output_assay_names=NULL,
genes=NULL,
samples=NULL,
params=list(
`quantile`=list(
ties=TRUE),
`jammanorm`=list(controlGenes=NULL,
minimum_mean=0,
controlSamples=NULL,
centerGroups=NULL,
useMedian=FALSE,
noise_floor=NULL,
noise_floor_value=NULL),
`limma_batch_adjust`=list(
batch=NULL,
group=NULL),
`TMM`=list(
refColumn=NULL,
logratioTrim=0.3,
sumTrim=0.05,
doWeighting=TRUE,
Acutoff=NULL),
`TMMwsp`=list(
refColumn=NULL,
logratioTrim=0.3,
sumTrim=0.05,
doWeighting=TRUE,
Acutoff=NULL),
`RLE`=list(
refColumn=NULL,
logratioTrim=0.3,
sumTrim=0.05,
doWeighting=TRUE,
Acutoff=NULL)),
normgroup=NULL,
floor=0,
enforce_norm_floor=TRUE,
output_sep="_",
override=TRUE,
populate_mcols=TRUE,
verbose=FALSE,
...)
{
assay_names <- intersect(assay_names,
names(SummarizedExperiment::assays(se)));
if (length(assay_names) == 0) {
cli::cli_abort(paste0(
"{.var assay_names} must be supplied."));
stop("assay_names must be supplied.");
}
method <- match.arg(method,
several.ok=TRUE);
if (length(output_method_prefix) > 0) {
if (!length(output_method_prefix) == length(method)) {
cli::cli_abort(paste0(
"{.var length(method)}={length(method)}",
" must equal {.var length(output_method_prefix)}=",
"{length(output_method_prefix)}. ",
"Cannot continue."))
}
} else {
output_method_prefix <- method;
}
# validate output_sep
if (length(output_sep) == 0) {
output_sep <- "";
}
output_sep <- head(output_sep, 1);
allgenes <- rownames(se);
allsamples <- colnames(se);
if (length(genes) == 0) {
genes <- allgenes;
} else {
genes <- intersect(genes, allgenes);
}
if (length(samples) == 0) {
samples <- allsamples;
} else {
samples <- intersect(samples, allsamples);
}
if (length(genes) == 0 || length(samples) == 0) {
cli::cli_abort(paste0(
"Only recognized {.var length(genes)}={length(genes)}",
" and {.var length(samples)}={length(samples)} ",
"in the input {.var se}. ",
"Cannot continue."))
stop(paste0("Only recognized ",
length(genes),
" genes, and ",
length(samples),
" samples in the input se."));
}
# optional normgroup
if (length(normgroup) > 0) {
if (length(names(normgroup)) > 0) {
if (!all(samples %in% names(normgroup))) {
cli::cli_abort(paste0(
"When {.var names(normgroup)} is defined ",
"all {.var samples} must be present in ",
"{.var names(normgroup)}, ",
"and this requirement was not met."));
stop(paste(
"When names(normgroup) is defined,",
"all samples must be present in names(normgroup).",
"This requirement was not met."));
}
normgroup <- normgroup[match(samples, names(normgroup))];
} else {
if (length(normgroup) != length(samples)) {
cli::cli_abort(paste0(
"When {.var names(normgroup)} is not defined ",
"length(normgroup)=",
"{length(normgroup)}",
", must equal ",
"length(samples)=",
"{length(samples)}."))
stop(paste(
"When names(normgroup) are not defined,",
"length(normgroup) must equal length(samples).",
"This requirement was not met."));
}
}
}
# define output_assay_names
expected_output_length <- length(assay_names) * length(method);
# define output_assay_names when none are supplied
if (length(output_assay_names) == 0) {
output_assay_names <- paste0(
rep(output_method_prefix, length(assay_names)),
output_sep,
rep(assay_names, each=length(method)));
}
# define names(output_assay_names)
if (length(output_assay_names) == expected_output_length) {
names(output_assay_names) <- paste0(
rep(assay_names, each=length(method)),
"_",
rep(method, length(assay_names)));
} else {
if (length(output_assay_names) > 0) {
cli::cli_abort(paste0(
"length(output_assay_names)=",
"{length(output_assay_names)}",
", must equal ",
"( length(assay_names) * length(method) )=",
"{length(assay_names) * length(method)}."))
stop(paste0("length(output_assay_names) must equal ",
"length(assay_names) * length(method)"))
}
}
# iterate assay_names, within which iterate method
for (assay_name in assay_names) {
for (imethod in method) {
output_assay_name <- unname(output_assay_names[
paste0(assay_name, "_", imethod)]);
if (output_assay_name %in% names(SummarizedExperiment::assays(se)) &&
!TRUE %in% override) {
if (verbose) {
jamba::printDebug("se_normalize(): ",
sep="",
c("Skipped, normalized data already exists for '",
output_assay_name,
"'"));
}
next;
}
if (verbose) {
jamba::printDebug("se_normalize(): ",
c("Applying '", imethod, "' to '",
assay_name, "' to store in '",
output_assay_name,
"'"),
sep="");
}
imatrix <- SummarizedExperiment::assays(
se[genes, samples])[[assay_name]];
inorm <- matrix_normalize(imatrix,
method=imethod,
params=params,
normgroup=normgroup,
verbose=(verbose - 1) > 0,
floor=floor,
enforce_norm_floor=enforce_norm_floor,
...);
# generate matrix of NA values to fill for normalized genes, samples
# this way we can add attributes to the full matrix
# when normalizing a smaller matrix
namatrix <- matrix(data=NA,
ncol=length(allsamples),
nrow=length(allgenes),
dimnames=list(allgenes,
allsamples));
# assign normalized data for genes, samples
namatrix[genes, samples] <- inorm;
# re-assign any missing attributes
# potentially useful information from the normalization method
na_attrnames <- names(attributes(namatrix));
new_attrs <- setdiff(names(attributes(inorm)), na_attrnames);
if (length(new_attrs) > 0) {
if (verbose > 1) {
jamba::printDebug("se_normalize(): ",
"re-assigning new_attrs: ",
sep=", ",
new_attrs);
}
for (new_attr in new_attrs) {
attr(namatrix, new_attr) <- attr(inorm, new_attr);
}
}
# assign namatrix to se
SummarizedExperiment::assays(se)[[output_assay_name]] <- namatrix;
# Bonus points: populate mcols() with normalization details
if (TRUE %in% populate_mcols) {
# create default DataFrame if it does not exist
if (length(SummarizedExperiment::mcols(
SummarizedExperiment::assays(se))) == 0) {
SummarizedExperiment::mcols(
SummarizedExperiment::assays(se)) <- S4Vectors::DataFrame(
assay_name=SummarizedExperiment::assayNames(se),
row.names=SummarizedExperiment::assayNames(se))
}
# define list to update mcols()
mcol_list <- jamba::rmNULL(c(
list(
assay_name=output_assay_name,
normalization_method=imethod,
source_assay_name=assay_name),
params[[imethod]]));
if (verbose > 1) {
jamba::printDebug("se_normalize(): ",
"Populating mcol_list into mcols(assays(se)):");
for (mname in names(mcol_list)) {
jamba::printDebug(paste0(mname, ": "),
indent=6,
mcol_list[[mname]])
}
}
mcolnames <- colnames(SummarizedExperiment::mcols(
SummarizedExperiment::assays(se)));
for (newcolname in names(mcol_list)) {
mvalues <- mcol_list[[newcolname]];
mna <- head(c(NA, mvalues), 1);
if (!newcolname %in% mcolnames) {
SummarizedExperiment::mcols(
SummarizedExperiment::assays(se))[,newcolname] <- mna;
}
if (length(mvalues) == 1) {
SummarizedExperiment::mcols(
SummarizedExperiment::assays(se))[
output_assay_name, newcolname] <- mvalues;
} else if (length(mvalues) > 1) {
SummarizedExperiment::mcols(
SummarizedExperiment::assays(se))[
output_assay_name, newcolname] <- I(list(mvalues));
}
}
# end names(mcol_list)
}
# end mcols(assays(se))
}
}
return(se);
}
#' Normalize a numeric data matrix
#'
#' Normalize a numeric data matrix
#'
#' This function is a wrapper for several relevant normalization
#' methods that operate on a numeric matrix.
#'
#' # Normalization Methods Implemented:
#'
#' ## method='quantile'
#'
#' Quantile-normalization performed by
#' `limma::normalizeQuantiles()`. This method has one
#' parameter `"ties"` passed to `limma::normalizeQuantiles()`,
#' the default here `ties=TRUE` which handles tied numeric
#' expression values in a robust way to avoid unpredictability
#' otherwise. This option is especially relevant with expression
#' count data, where integer counts cause a large number
#' of values to be represented multiple times.
#'
#' ## method='jammanorm'
#'
#' Median-normalization performed by
#' `jamma::jammanorm()`. This method shifts expression
#' data as shown on MA-plots, so the median expression
#' is zero across all samples, using only the rows that
#' meet the relevant criteria.
#'
#' Some relevant criteria to
#' define rows used for normalization:
#'
#' * `controlGenes` defines specific genes to use for
#' normalization, such as housekeeper genes. It may also
#' be useful to use detected genes here, so the normalization
#' only considers those genes defined as detected by
#' the protocol.
#' * `minimum_mean` sets a numeric threshold and requires
#' the mean expression (shown on the x-axis of the MA-plot)
#' to be at least this value.
#'
#' Note that when both `controlGenes` and `minimum_mean`
#' are defined, both criteria are enforced. So the `controlGenes`
#' are also required to have expression of at least `minimum_mean`.
#'
#' Also note that all rows of data are normalized by this method,
#' only the subset of rows defined by `controlGenes` and `minimum_mean`
#' are used to compute the normalization factor.
#'
#'
#' ## method='limma_batch_adjust'
#'
#' Batch adjustment performed by
#' `limma::removeBatchEffect()` which is intended to apply
#' batch-adjustment as a form of normalization, but which
#' does not represent full normalization itself. There are
#' two relevant parameters: `"batch"` which is a vector of
#' batch values in order of `colnames(x)`, and `"group"`
#' which is a vector of sample groups in order of `colnames(x)`.
#'
#' # Normalization groups via `normgroup`
#'
#' The `normgroup` argument is intended as a convenient method to
#' apply a normalization method to each independent `normgroup`.
#' This situation is especially useful when a study contains
#' multiple tissue types, or multiple data types, that may not be
#' appropriate to normalize directly relative to one another.
#'
#' For example, one could normalize total RNA-seq and nascent 4sU-seq
#' data independently, without expectation that the two would ever
#' have a common frame of reference to normalize one relative to another.
#' However, both may be amenable to `"quantile"` or
#' `"jammanorm"` median normalization.
#'
#' Similarly, one could normalize each tissue type independently,
#' which may be appropriate when analyzing data that contains very
#' different mammalian tissue organ samples, such as muscle and brain.
#'
#' It would generally not be appropriate to use quantile normalization
#' across muscle and brain samples, since the overall pattern and
#' distribution of expression values is not expected to be similar.
#' Quantile normalize assumes (and imposes) a common distribution,
#' by adjusting mean expression signal at each quantile to a common
#' mean expression across all samples.
#'
#' For a rough approximation of cross-tissue normalization, one
#' could apply `"quantile"` normalization within each `normgroup` defined
#' by tissue type, then apply `"jammanorm"` median normalization to
#' apply a linear adjustment of signal across tissue types. The median
#' normalization does not affect distribution, thus will not affect
#' intra-tissue contrasts, except by adjusting its overall signal
#' which may change downstream assumptions regarding signal thresholds.
#'
#' It is recommended not to compare directly across tissue types.
#' In some cases a two-way contrast may be appropriate, where
#' fold change within one tissue type is compared to the
#' fold change within another tissue type. However, even in that
#' case the two tissue types do not need to be normalized relative to
#' each other upfront - the within-tissue fold change serves as
#' one method of normalizing the observations across tissue types.
#'
#' # Other useful parameters
#'
#' Note the `floor` and `enforce_norm_floor` have recommended
#' default values `floor=0` and `enforce_norm_floor=TRUE`.
#'
#' * `floor` is applied prior to normalization, typically
#' to minimize effects of low, noisy signal on the normalization
#' process itself. Specifically, this floor is used to remove negative
#' values, which may be by-products of upstream signal processing.
#' "A measured signal at or below the noise floor of a platform
#' technology is effectively the same as a signal at the noise floor."
#' * `enforce_norm_floor` is applied after normalization, typically
#' as a convenience, also to prevent low, noisy signal from
#' contributing to downstream analysis steps.
#'
#' These defaults will set any assay value at or below `0` to `0`,
#' and after normalization any values whose input values were
#' at or below `0` will also be set to `0` to prevent normalizing
#' a value of `0` to non-zero. Any normalized value at or
#' below `0` will also be set to `0` to prevent results from
#' containing negative normalized values.
#'
#' The assumption for this default is that a value of zero
#' is not a measurement but represents the lack of a measurement.
#' Similarly, the intent of `floor` is a numeric threshold at or
#' below there is no confidence in the reported measurement, therefore
#' values at or below this threshold are treated as equivalent
#' to the threshold for the purpose of downstream analyses.
#'
#' Some platforms like QPCR for example, have substantially lower
#' confidence at high CT values, where expression values
#' using the equation `2^(40-CT)` might impose a noise threshold
#' at expression 32 or lower. This noise threshold for QPCR
#' means any expression measurement of 32 or lower is as likely
#' to be `32` as it is to be `2`, and therefore any differences
#' between reported expression of `32` and `2` should not be
#' considered relevant. Applying `floor=32` in this case
#' accomplishes this goal by setting all values at or below
#' `32` to `32`. Of course when using this method `matrix_normalize()`
#' the data should be log2 transformed, which means the `floor`
#' should also be log2 transformed, e.g. `floor=log2(32)`
#' which is `floor=5`.
#'
#' One alternative might be to set values at or below zero to `NA`
#' prior to normalization, and before calling `matrix_normalize()`.
#' In this case, only non-NA values will be used during
#' normalization according to the `method` being used.
#'
#' @returns `numeric` matrix with the same dimensions as the
#' input matrix `x`.
#' Additional information may be returned as `attributes(x)`:
#' * `"norm_method"`: a `character` string with the method used.
#' * `"nf"`: a `numeric` vector with normalization factors, returned
#' only by `"jammanorm"`, `"TMM"`, and `"TMMwsp"`.
#' * `"hk"`: a `character` vector of `rownames(x)` used as housekeeper
#' `controlGenes` by `"jammanorm"`.
#'
#' @family jamses utilities
#'
#' @param x `numeric` matrix with sample columns, and typically
#' gene rows, but any measured assay row will meet the assumptions
#' of the method.
#' @param method `character` string indicating which normalization
#' method to apply.
#' * `"quantile"`: quantile normalization via `limma::normalizeQuantiles()`
#' * `"jammanorm"`: log-ratio normalization via `jamma::jammanorm()`
#' * `"limma_batch_adjust"`: batch adjustment via
#' `limma::removeBatchEffect()`, recommended for data visualization,
#' but not recommended for downstream statistical comparisons.
#' * `"TMM"`: trimmed mean of M-values via `edgeR::calcNormFactors()`
#' * `"TMMwsp"`: TMM with singleton pairing via `edgeR::calcNormFactors()`
#' * `"RLE"`: relative log expression via `edgeR::calcNormFactors()`
#' @param apply_log2 `character` string indicating whether to apply
#' log2 transformation: `"ifneeded"` will apply log2 transform
#' when any absolute value is greater than 40; `"no"` will not
#' apply log2 transformation; `"always"` will apply log2 transform.
#' Note the log2 transform is applied with `jamba::log2signed(x, offset=1)`
#' which is equivalent to `log(1 + x)` except that negative values
#' are also transformed using the absolute value, then multiplied
#' by their original sign.
#' @param floor `numeric` value indicating the lowest accepted numeric
#' value, below which values are assigned to this floor. The default
#' `floor=0` requires all values are `0`, and any values below `0` are
#' assigned `0`. Note that the `floor` is applied after log2 transform,
#' when the log2 transform is performed.
#' @param floor_value `numeric` or `NA` used to replace values in `x`
#' when they are at or below `floor`, default `floor_value=floor`,
#' however it can be useful to assign `NA` to replace zero in
#' circumstances when that is preferable.
#' @param enforce_norm_floor `logical` indicating whether to enforce the
#' `floor` for the normalized results, default is `TRUE`. For example,
#' when `floor=0` any values at or below `0` are set to `0` before
#' normalization. After normalization some of these values will be
#' above or below `0`. When `enforce_norm_floor=TRUE` these values
#' will again be set to `0` because they are considered to be
#' below the noise threshold of the protocol, and adjustments
#' are not relevant; also any normalized values below the `floor`
#' will also be set to `floor`.
#' @param params `list` of parameters relevant to the `method` of
#' normalization. The `params` should be a `list` named by the `method`,
#' whose values are a list named by the relevant method parameter.
#' See examples.
#' @param normgroup `character` or equivalent vector that defines subgroups
#' of `samples` to be normalized indendently of each normgroup. When
#' `NULL` then all data is normalized together as default.
#' The `normgroup` vector is expected to be in the order of
#' `colnames(x)` in the same order.
#' @param subset_columns `integer` intended for internal use when
#' `normgroups` is provided. This argument is used to instruct
#' each normalization method to use an appropriate subset of
#' `params` based upon the subset of columns being analyzed.
#' @param verbose `logical` indicating whether to print verbose output.
#'
#' @examples
#' # use farrisdata real world data if available
#' if (jamba::check_pkg_installed("farrisdata")) {
#'
#' suppressPackageStartupMessages(library(SummarizedExperiment))
#'
#' # test matrix_normalize()
#' GeneSE <- farrisdata::farrisGeneSE;
#' imatrix <- assays(GeneSE)$raw_counts;
#' genes <- rownames(imatrix);
#' samples <- colnames(imatrix);
#' head(imatrix);
#'
#' # matrix_normalize()
#' # normalize the numeric matrix directly
#' imatrix_norm <- matrix_normalize(imatrix,
#' genes=genes,
#' samples=samples,
#' method="jammanorm",
#' params=list(minimum_mean=5))
#' names(attributes(imatrix_norm))
#'
#' # review normalization factors
#' round(digits=3, attr(imatrix_norm, "nf"));
#'
#' # example for quantile normalization
#' imatrix_quant <- matrix_normalize(imatrix,
#' genes=genes,
#' samples=samples,
#' method="quantile")
#' names(attributes(imatrix_quant))
#' }
#'
#'
#' # simulate reasonably common expression matrix
#' set.seed(123);
#' x <- matrix(rnorm(9000)/4, ncol=9);
#' colnames(x) <- paste0("sample", LETTERS[1:9]);
#' rownames(x) <- paste0("gene", jamba::padInteger(seq_len(nrow(x))))
#' rowmeans <- rbeta(nrow(x), shape1=2, shape2=5)*14+2;
#' x <- x + rowmeans;
#' for (i in 1:9) {
#' x[,i] <- x[,i] + rnorm(1);
#' }
#'
#' # display MA-plot with jamma::jammaplot()
#' jamma::jammaplot(x)
#'
#' # normalize by jammanorm
#' xnorm <- matrix_normalize(x, method="jammanorm")
#' jamma::jammaplot(xnorm, maintitle="method='jammanorm'")
#'
#' # normalize by jammanorm with housekeeper genes
#' hk_genes <- sample(rownames(x), 10);
#' xnormhk <- matrix_normalize(x,
#' method="jammanorm",
#' params=list(jammanorm=list(controlGenes=hk_genes)))
#'
#' jamma::jammaplot(xnormhk,
#' maintitle="method='jammanorm' with housekeeper genes",
#' highlightPoints=list(housekeepers=hk_genes),
#' highlightColor="green");
#'
#' xnormhk6 <- matrix_normalize(x,
#' method="jammanorm",
#' params=list(jammanorm=list(
#' controlGenes=hk_genes,
#' minimum_mean=6)))
#' hk_used <- attr(xnormhk6, "hk")[[1]];
#' jamma::jammaplot(xnormhk6,
#' maintitle="method='jammanorm' with housekeeper genes, minimum_mean=6",
#' highlightPoints=list(housekeepers=hk_genes,
#' hk_used=hk_used),
#' highlightColor=c("red", "green"));
#'
#' # normalize by quantile
#' xquant <- matrix_normalize(x, method="quantile")
#' jamma::jammaplot(xquant,
#' maintitle="method='quantile'")
#'
#' # simulate higher noise for lower signal
#' rownoise <- rnorm(prod(dim(x))) * (3 / ((rowmeans*1.5) - 1.5));
#' xnoise <- x;
#' xnoise <- xnoise + rownoise;
#' jamma::jammaplot(xnoise,
#' maintitle="simulated higher noise at lower signal");
#'
#' # simulate non-linearity across signal
#' # sin(seq(from=pi*4/10, to=pi*7/10, length.out=100))-0.8
#' rowadjust <- (sin(pi * jamba::normScale(rowmeans, from=3.5/10, to=5.5/10)) -0.9) * 20;
#' xwarp <- xnoise;
#' xwarp[,3] <- xnoise[,3] + rowadjust;
#' jamma::jammaplot(xwarp,
#' maintitle="signal-dependent noise and non-linear effects");
#'
#' # quantile-normalization is indicated for this scenario
#' xwarpnorm <- matrix_normalize(xwarp,
#' method="quantile");
#' jp <- jamma::jammaplot(xwarpnorm,
#' maintitle="quantile-normalized: signal-dependent noise and non-linear effects");
#'
#' @export
matrix_normalize <- function
(x,
method=c("quantile",
"jammanorm",
"limma_batch_adjust",
"TMM",
"TMMwsp",
"RLE"),
apply_log2=c("ifneeded",
"no",
"always"),
floor=0,
floor_value=floor,
enforce_norm_floor=TRUE,
params=list(
#`vsn`=list(lts.quantile=0.5),
#`rsn`=list(excludeFold=2,
# span=0.03),
#cyclicLoess=list(method="default",
# span=0.8),
`quantile`=list(
ties=TRUE),
`jammanorm`=list(controlGenes=NULL,
minimum_mean=0,
controlSamples=NULL,
centerGroups=NULL,
useMedian=FALSE,
noise_floor=NULL,
noise_floor_value=NULL),
`limma_batch_adjust`=list(
batch=NULL,
group=NULL),
`TMM`=list(
refColumn=NULL,
logratioTrim=0.3,
sumTrim=0.05,
doWeighting=TRUE,
Acutoff=NULL),
`TMMwsp`=list(
refColumn=NULL,
logratioTrim=0.3,
sumTrim=0.05,
doWeighting=TRUE,
Acutoff=NULL)),
normgroup=NULL,
subset_columns=NULL,
debug=FALSE,
verbose=TRUE,
...)
{
method <- match.arg(method);
apply_log2 <- match.arg(apply_log2);
## Update the default methodParams without losing the original defaults if not overriden
params <- update_function_params("matrix_normalize",
"params",
params,
verbose=FALSE);
# optional normgroup
if (length(normgroup) > 0) {
if (length(normgroup) != ncol(x)) {
stop(paste("length(normgroup) was not equal to ncol(x)."));
}
x_colnames <- seq_len(ncol(x));
x_split <- split(x_colnames, normgroup);
x_split_v <- unlist(unname(x_split));
x_split_order <- order(x_split_v);
normgroup_out <- lapply(x_split, function(i_colnames){
matrix_normalize(x=x[, i_colnames, drop=FALSE],
method=method,
apply_log2=apply_log2,
floor=floor,
floor_value=floor_value,
enforce_norm_floor=enforce_norm_floor,
params=params,
normgroup=NULL,
subset_columns=i_colnames,
verbose=verbose,
...);
});
#return(normgroup_out);
# combine normalized matrices
normgroup_x <- do.call(cbind, normgroup_out)[, x_split_order, drop=FALSE];
# combine attributes
normgroup_attrnames <- unique(unlist(lapply(normgroup_out, function(inorm){
attr_names <- setdiff(names(attributes(inorm)),
c("dim", "names", "dimnames"))
})))
if (length(normgroup_attrnames) > 0) {
normgroup_attrs <- lapply(jamba::nameVector(normgroup_attrnames), function(iattr){
unlist(recursive=FALSE,
lapply(unname(normgroup_out), function(inorm){
attributes(inorm)[[iattr]]
}))[x_split_order]
})
for (iattr in normgroup_attrnames) {
attr(normgroup_x, iattr) <- normgroup_attrs[[iattr]];
}
}
if (debug) {
return(list(
normgroup_out=normgroup_out,
normgroup_attrnames=normgroup_attrnames,
normgroup_x=normgroup_x));
}
return(normgroup_x);
}
# apply log2 transform if needed
if ("ifneeded" %in% apply_log2) {
if (any(abs(x) > 40 & !is.na(x))) {
x <- jamba::log2signed(x,
offset=1);
if (verbose) {
jamba::printDebug("matrix_normalize(): ",
c("Applied ",
"jamba::log2signed(x, offset=1)",
" because ",
"any(abs(x) > 40)"),
sep="");
}
}
} else if ("always" %in% apply_log2) {
x <- jamba::log2signed(x,
offset=1);
if (verbose) {
jamba::printDebug("matrix_normalize(): ",
c("Applied ", "jamba::log2signed(x, offset=1)"),
sep="");
}
}
# apply optional floor
if (length(floor) > 0 && any(x <= floor & !is.na(x))) {
if (length(floor_value) == 0) {
floor_value <- floor;
}
x_floored <- (x <= floor & !is.na(x));
x[x_floored] <- floor_value;
if (verbose) {
jamba::printDebug("matrix_normalize(): ",
c("Applied floor:", floor,
", replacing with:", floor_value),
sep="");
}
} else {
x_floored <- FALSE
}
if ("quantile" %in% method) {
ties <- params$quantile$ties;
if (length(ties) == 0) {
ties <- TRUE;
}
inorm <- limma::normalizeQuantiles(A=x,
ties=ties);
attr(inorm, "norm_method") <- "quantile";
} else if ("jammanorm" %in% method) {
controlGenes <- params$jammanorm$controlGenes;
minimum_mean <- params$jammanorm$minimum_mean;
controlSamples <- params$jammanorm$controlSamples;
centerGroups <- params$jammanorm$centerGroups;
useMedian <- params$jammanorm$useMedian;
noise_floor <- params$jammanorm$noise_floor;
noise_floor_value <- params$jammanorm$noise_floor_value;
if (length(noise_floor) == 0) {
noise_floor <- -Inf;
}
if (length(noise_floor_value) == 0) {
noise_floor_value <- noise_floor;
}
if (verbose) {
jamba::printDebug("matrix_normalize(): ",
"Calling ", "jammanorm():",
c("\n minimum_mean:", minimum_mean,
"\n useMedian:", useMedian,
if (length(noise_floor) > 0 && noise_floor > -Inf) {
c("\n noise_floor:", noise_floor)},
if ((length(jamba::rmNA(noise_floor_value)) > 0 && jamba::rmNA(noise_floor_value) > -Inf) ||
any(is.na(noise_floor_value))) {
c("\n noise_floor_value:", noise_floor_value)}),
sep="");
}
# if no rownames(x) exist, use index integers
x_rownames <- rownames(x);
if (length(x_rownames) == 0) {
rownames(x) <- seq_len(nrow(x));
}
inorm <- jamma::jammanorm(x,
controlGenes=controlGenes,
minimum_mean=minimum_mean,
controlSamples=controlSamples,
centerGroups=centerGroups,
useMedian=useMedian,
useMean=NULL,
noise_floor=noise_floor,
noise_floor_value=noise_floor_value,
verbose=verbose);
# if (nrow(inorm) == nrow(x) && length(x_rownames) == 0) {
# rownames(inorm) <- x_rownames;
# }
attr(inorm, "norm_method") <- "jammanorm";
} else if ("limma_batch_adjust" %in% method) {
batch <- params$limma_batch_adjust$batch;
group <- params$limma_batch_adjust$group;
if (length(subset_columns) > 0) {
batch <- batch[subset_columns];
group <- group[subset_columns];
}
designBatchGroup <- model.matrix(~0+group);
rownames(designBatchGroup) <- colnames(x);
## limma::removeBatchEffect()
inorm <- limma::removeBatchEffect(x,
batch=batch,
design=designBatchGroup);
attr(inorm, "norm_method") <- "limma_batch_adjust";
} else if (any(c("TMMwsp", "TMM", "RLE") %in% method)) {
if (!jamba::check_pkg_installed("edgeR")) {
stop("The edgeR package is required to normalize by 'TMM'");
}
# method parameters
refColumn <- params$TMM$refColumn;
logratioTrim <- params$TMM$logratioTrim;
if (length(logratioTrim) == 0) {
logratioTrim <- 0.3;
}
sumTrim <- params$TMM$sumTrim;
if (length(sumTrim) == 0) {
sumTrim <- 0.05;
}
doWeighting <- params$TMM$doWeighting;
if (length(doWeighting) == 0) {
doWeighting <- TRUE;
}
Acutoff <- params$TMM$Acutoff;
if (length(Acutoff) == 0) {
Acutoff <- -1e10;
}
# TMM disallows NA values, so we must replace with zero
if (any(is.na(x))) {
x[is.na(x)] <- 0;
}
# obtain normalization factors
normFactors <- edgeR::calcNormFactors(
object=2^x - 1,
method=method,
refColumn=refColumn,
logratioTrim=logratioTrim,
sumTrim=sumTrim,
doWeighting=doWeighting,
Acutoff=Acutoff);
# apply normalization factors
nfs <- log2(normFactors)
inorm <- x + rep(nfs, each=nrow(x));
# inorm <- x / rep(normFactors, each=nrow(x));
# store normalization factors as attributes
attr(inorm, "nf") <- nfs;
attr(inorm, "normFactors") <- normFactors;
attr(inorm, "logratioTrim") <- logratioTrim;
attr(inorm, "sumTrim") <- sumTrim;
attr(inorm, "doWeighting") <- doWeighting;
attr(inorm, "Acutoff") <- Acutoff;
attr(inorm, "norm_method") <- method;
}
# enforce the floor for output matrix
if (enforce_norm_floor && any(x_floored)) {
inorm[(x_floored | inorm <= floor) & !is.na(inorm)] <- floor_value;
}
return(inorm);
}
#' Update function default parameters
#'
#' Update function default parameters
#'
#' This function is a minor extension to `update_list_elements()`
#' intended to help update function parameters which are defined
#' as a nested list.
#'
#' @family jamses utilities
#'
#' @export
update_function_params <- function
(function_name=NULL,
param_name=NULL,
new_values=NULL,
verbose=FALSE,
...)
{
## Purpose is to facilitate updating default parameters which are present in a list,
## but where defaults are defined in a function name. So if someone wants to change
## one of the default values, but keep the rest of the list of defaults, this function
## does it.
##
## The default values are taken from the function formals, using eval(formals(functionName)).
default_params <- eval(formals(function_name)[[param_name]]);
if (verbose) {
jamba::printDebug("update_function_params(): ",
"str(default_params)");
print(str(default_params));
}
default_params <- update_list_elements(default_params,
new_values,
verbose=verbose);
return(default_params);
}
#' Update a subset of list elements
#'
#' Update a subset of list elements
#'
#' This function is intended to help update a nested `source_list`,
#' a subset of whose values should be replaced with entries
#' in `update_list`, leaving any original entries in `source_list`
#' which were not defined in `update_list`.
#'
#' This function may be useful when manipulating lattice or ggplot2
#' graphical parameters, which are often stored in a nested
#' list structure.
#'
#' @family jamses utilities
#'
#' @export
update_list_elements <- function
(source_list,
update_list,
list_layer_num=1,
verbose=FALSE,
...)
{
## Purpose is to update elements in a list, allowing for multi-layered lists
## of lists. In case of a list-of-list, it will call this function again with
## each successive layer of listedness.
##
## Handy for updating lattice graphics settings, which are impossibly nested
## tangled ball of textual yarn. An example:
## tp2 <- updateListElements(trellis.par.get(), list(fontsize=list(points=6)));
## trellis.par.set(tp2);
##
## Or in one line:
## trellis.par.set(updateListElements(trellis.par.get(), list(fontsize=list(points=6))));
if (length(update_list) == 0) {
return(source_list);
}
if ("list" %in% class(update_list) && "list" %in% class(update_list[[1]])) {
for (update_list_name in names(update_list)) {
if (update_list_name %in% names(source_list)) {
## If the name already exists, we must update items within the list
source_list[[update_list_name]] <- update_list_elements(
source_list=source_list[[update_list_name]],
update_list=update_list[[update_list_name]],
list_layer_num=list_layer_num+1);
} else {
## If the name does not already exist, we can simply add it.
source_list[[update_list_name]] <- update_list[[update_list_name]];
}
}
} else {
if (!is.null(names(update_list))) {
source_list[names(update_list)] <- update_list;
} else {
source_list <- update_list;
}
}
return(source_list);
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.