upbm: Tools for the analysis of universal protein binding microarrays

Documented in normalizeWithinReplicates

#' @title Normalize within replicates
#'
#' @description
#' Universal PBM experiments are often performed with several conditions of interest,
#' e.g. allelic variants, assayed on separate arrays of the same plate with few replicates.
#' Within and across plates, probe intensities can vary for biologically
#' uninteresting reasons, such as concentration differences. To explicitly correct for these
#' differences, normalization is performed in two steps.
#'
#' First, normalization is performed within replicates (plates) with the assumption that
#' biologically uninteresting differences only affect probe intensities multiplicatively.
#' Normalization factors are estimated for each sample relative to a baseline condition on each
#' plate. The baseline should ideally be a replicate wild type or other natural reference condition
#' included in each replicate (plate). This function includes approaches for performing this step of
#' normalization.
#'
#' Second, normalization is performed across replicates (plates). More detail on this procedure 
#' can be found in the \code{\link{normalizeAcrossReplicates}} documentation.
#'
#' The approaches to normalization implemented in this function make a fundamental assumption
#' that lower-tail probe intensities are distributed similarly across the conditions being
#' normalized. This assumption is generally satisfied for allelic variants of the same transcription
#' factor or transcription factors with similar binding affinities. However, this assumption may
#' not always hold, e.g. if comparing proteins of completely different families. \emph{In these cases,
#' normalization should be performed with caution, and analyses and plots comparing the distributions
#' of lower-tail probe intensities should be explored.}
#'  
#' @param pe a SummarizedExperiment object containing GPR intensity information.
#' @param assay a string name of the assay to normalize.
#'        (default = \code{SummarizedExperiment::assayNames(pe)[1]})
#' @param method a string specifying the method to use for normalization. Must be one of
#'        \code{"tmm"} or \code{"quantile"}. Details on the methods are provided below.
#'        (default = \code{"tmm"})
#' @param q a percentile between 0 and 1 specifying either the upper quantile of probes to include 
#'        for normalization when \code{method = "tmm"} or the quantile to use for aligning samples
#'        when \code{method = "quantile"}. (default = 0.6)
#' @param qlower a percentile between 0 and 1-\code{q} specifying the lower quantile of
#'        probes to include for normalization when \code{method = "tmm"}. This parameter is
#'        ignored when \code{method = "quantile"}. (defalut = 0)
#' @param qdiff a percentile between 0 and 0.5 specifying the additional fraction of lower-tail
#'        probes to filter based on the deviation from the baseline condition when \code{method = "tmm"}.
#'        Probes with the \code{qdiff} smallest (most negative) and the \code{qdiff} largest (most positive)
#'        deviations from baseline condition will be filtered from normalization.
#'        This parameter is ignored when \code{method = "quantile"}. (default = 0.2)
#' @param group a character string specifying a column in \code{colData(pe)} to use for grouping replicates.
#'        If scans shouldn't be grouped, specify NULL. (default = \code{"id"})
#' @param stratify a character string specifying a column in \code{colData(pe)} to use for determining
#'        the unique baseline scan within each \code{group}. (default = \code{"condition"})
#' @param baseline a character string specifying the baseline condition in the \code{stratify} column to normalize
#'        other conditions against within each \code{group}. If not specified and set to NULL, the baseline
#'        value is guessed by looking for values in the \code{stratify} column ending in ``ref". If multiple
#'        unique matching values are found, one value is chosen arbitrarily. If the baseline condition is missing
#'        from any \code{group} with more than one scan, an error is thrown. (default = NULL)
#' @param verbose a logical value whether to print verbose output during analysis. (default = FALSE)
#'
#' @details
#' The trimmed mean of M-values (\code{"tmm"}) method implemented in this function for cross-sample normalization
#' within replicates is based on the popular TMM method for RNA-seq data included
#' in the \code{edgeR} package. Very simply, a normalization factor is estimated as the trimmed mean
#' of probe-level log-scale differences between the baseline condition and sample using the lower
#' \code{[qlower, q]} percentile probes. Probes are ordered by the log-scale average intensity across
#' the baseline condition and sample. The trimmed mean is calculated excluding the top and bottom \code{qdiff}
#' probes.
#'
#' Unlike RNA-seq expression estimates, PBM data show near-constant variance in log-scale differences
#' as a function of the log-scale mean intensities. Therefore, a simplified variant of the original
#' TMM method is used, where precision weights are not introduced.
#'
#' The quantile-based (\code{"quantile"}) method \emph{should not be confused with what is commonly referred to
#' as ``quantile normalization."} Here, quantile-based normalization computes scaling factors across 
#' 
#' @return
#' Original PBMExperiment object with assay containing within-replicate normalized intensities
#' (\code{"normalized"}) and a new column added to the colData, \code{"withinRepScale"},
#' containing the inverse of the scaling factors used to normalize intensities.
#' If an assay with the same name is already included in the object, it will be overwritten.
#'
#' @seealso \code{\link{normalizeAcrossReplicates}}
#' @importFrom stats quantile
#' @importFrom dplyr as_tibble tibble mutate group_by filter do select ungroup left_join
#' @importFrom tidyr pivot_longer pivot_wider nest
#' @importFrom rlang enquo quo_name
#' @export
#' @author Dongyuan Song, Patrick Kimes
normalizeWithinReplicates <- function(pe, assay = SummarizedExperiment::assayNames(pe)[1],
                                      method = c("tmm", "quantile"), q = 0.6, qlower = 0, qdiff = 0.2,
                                      group = "id", stratify = "condition", baseline = NULL,
                                      verbose = FALSE) {
    stopifnot(is(pe, "PBMExperiment"))
    method <- match.arg(method)

    stopifnot(q >= 0, q < 1)
    stopifnot(qlower < 1 - q)
    stopifnot(qdiff >= 0, qdiff <= 0.5)
    stopifnot(assay %in% SummarizedExperiment::assayNames(pe))

    ## check normalization groups
    stopifnot(is.null(group) || group %in% names(colData(pe)))
    stopifnot(stratify %in% names(colData(pe)))
    
    if (verbose) {
        cat("|| upbm::normalizeWithinReplicates \n")
        cat("|| - Starting within-replicate normalization for", ncol(pe), "PBM scans.\n")
    }

    if (verbose) {
        cat("|| - Filtering probes according to", length(pe@probeFilter),
            "probeFilter rule(s).\n")
        ntotal <- nrow(pe)
    }
    
    if (ncol(pe) == 1L) {
        SummarizedExperiment::assays(pe) <- c(S4Vectors::SimpleList(normalized = assay(pe, assay)),
                                              SummarizedExperiment::assays(pe))
        if (verbose) {
            cat("|| - Since only 1 scan, no normalization is performed.\n")
            cat("|| - Finished within-replicate normalization.\n")
            cat("|| - Returning PBMExperiment with", nrow(pe), "rows and", ncol(pe), "columns.\n")
        }
        return(pe)
    }
    
    ## filter probes - only for computing shift/scale factors (return original pe)
    fpe <- pbmFilterProbes(pe)
    rdat <- dplyr::as_tibble(as.data.frame(rowData(fpe)[, fpe@probeCols, drop = FALSE], optional = TRUE))

    if (verbose) {
        cat("|| - Data filtered from", ntotal, "probes to", nrow(fpe), "probes.\n")
    }

    ## separate out groups with only single scan
    if (is.null(group)) {
        isingletons <- FALSE
    } else {
        singletons <- table(colData(fpe)[[group]])
        singletons <- names(singletons)[singletons == 1]
        isingletons <- length(singletons) > 0
        if (isingletons) {
            singleton_cols <- colData(fpe)[[group]] %in% singletons
            singleton_fits <- dplyr::tibble(sample = colnames(fpe)[singleton_cols],
                                            Group = colData(fpe)[[group]][singleton_cols],
                                            Stratify = colData(fpe)[[stratify]][singleton_cols],
                                            isBaseline = TRUE, 
                                            withinRepScale = 1L)
            fpe <- fpe[, !singleton_cols]
        }
    }
    
    ## check normalization stratification settings
    strats <- .pbmCheckStratify(s = fpe, strat = stratify, bl = baseline, gp = group,
                                needbl = TRUE, verb = verbose)
    coldat <- strats$coldat
    baseline <- strats$baseline

    if (verbose) {
        cat("|| - Performing within-replicate normalization with:\n")
        cat("||     - replicate group column:", group, "\n")
        cat("||     - condition column:", stratify, "\n")
        cat("||     - baseline condition:", baseline, "\n")
    }

    ## tidy up data for computing factors
    scale_assay <- as.data.frame(assay(fpe, assay), optional = TRUE)
    scale_assay <- dplyr::as_tibble(scale_assay)
    scale_assay <- dplyr::bind_cols(scale_assay, rdat)

    ## need to merge with columns from row data 
    scale_assay <- tidyr::pivot_longer(scale_assay, names_to = "sample",
                                       values_to = "value", -(!! fpe@probeCols))
    scale_assay <- dplyr::left_join(scale_assay, coldat, by = "sample")
    
    assay_fits <- dplyr::filter(scale_assay, !is.na(value), value > 0)

    if (method == "tmm") {
        if (verbose) {
            cat("|| - Performing TMM normalization with:\n")
            cat("||     - q:", q, "\n")
            cat("||     - qlower:", qlower, "\n")
            cat("||     - qdiff:", qdiff, "\n")
        }
        
        bl_assay <- dplyr::filter(assay_fits, isBaseline)
        bl_assay <- dplyr::select_at(bl_assay, c("value", "Group", fpe@probeCols))
        assay_fits <- dplyr::left_join(assay_fits, bl_assay, by = c("Group", fpe@probeCols),
                                       suffix = c("", ".bl"))
        assay_fits <- dplyr::filter(assay_fits, !is.na(value.bl))
        assay_fits <- dplyr::mutate(assay_fits,
                                    M.value = (log2(value) - log2(value.bl)),
                                    A.value = (log2(value) + log2(value.bl))/2)  
        assay_fits <- tidyr::nest(assay_fits, data = c(-sample, -Stratify, -Group, -isBaseline))

        assay_fits <- dplyr::mutate(assay_fits,
                                    withinRepScale = vapply(data, function(x) {
                                        z <- dplyr::filter(x, A.value < quantile(A.value, q))
                                        z <- dplyr::filter(z, A.value > quantile(A.value, qlower))
                                        z <- dplyr::summarise(z, est_scale = 2^mean(M.value, trim = qdiff, na.rm = TRUE))
                                        as.numeric(z$est_scale)
                                    }, numeric(1L)))
        assay_fits <- dplyr::select(assay_fits, -data)
        
    } else if (method == "quantile") {
        if (verbose) {
            cat("|| - Performing quantile-based normalization with:\n")
            cat("||     - q:", q, "\n")
        }
        assay_fits <- dplyr::group_by(scale_assay, sample, Stratify, Group, isBaseline)
        assay_fits <- dplyr::summarize(assay_fits, ul = quantile(value, probs = q, na.rm = TRUE))
        assay_fits <- dplyr::ungroup(assay_fits)

        assay_fits <- dplyr::group_by(assay_fits, Group)
        assay_fits <- dplyr::mutate(assay_fits, withinRepScale = ul / ul[isBaseline])
        assay_fits <- dplyr::ungroup(assay_fits)
        assay_fits <- dplyr::select(assay_fits, -ul)

    } else {
        stop("Specified 'method=' parameter is invalid.")
    }

    ## add singletons back
    if (isingletons) {
        assay_fits <- dplyr::bind_rows(assay_fits, singleton_fits)
    }
    
    ## tidy up original data
    new_assay <- as.data.frame(assay(pe, assay), optional = TRUE)
    new_assay <- dplyr::as_tibble(new_assay)
    new_assay <- dplyr::mutate(new_assay,
                               Row = rowData(pe)[, "Row"],
                               Column = rowData(pe)[, "Column"])
    new_assay <- tidyr::pivot_longer(new_assay, names_to = "sample",
                                     values_to = "value", c(-Row, -Column))
    ##new_assay <- dplyr::left_join(new_assay, coldat, by = "sample")
    
    ## adjust to reference
    new_assay <- dplyr::left_join(new_assay, assay_fits, by = "sample")
    new_assay <- dplyr::mutate(new_assay, value = value / withinRepScale)
    
    ## return to square assay shape
    new_assay <- dplyr::select(new_assay, sample, value, Row, Column)
    new_assay <- tidyr::pivot_wider(new_assay, names_from = sample, values_from = value)
    
    ## match row order to rowData
    c_order <- paste(rowData(pe)$Row, rowData(pe)$Column, sep = "-")
    new_order <- match(c_order, paste(new_assay$Row, new_assay$Column, sep = "-"))
    stopifnot(!duplicated(new_order), length(new_order) == nrow(pe))
    new_assay <- new_assay[new_order, ]
    new_assay <- dplyr::select(new_assay, -Row, -Column)
    
    ## match column order to colData
    stopifnot(colnames(new_assay) %in% colnames(pe))
    new_assay <- new_assay[, colnames(pe)]
    new_assay <- DataFrame(new_assay, check.names = FALSE)
    
    ## add to input PBMExperiment
    if ("normalized" %in% assayNames(pe)) {
        SummarizedExperiment::assay(pe, "normalized") <- NULL
        if (verbose) {
            cat("|| - Original PBMExperiment object contains \"normalized\" assay.\n")
            cat("|| - Existing \"normalized\" assay will be overwritten.\n")
        }
    }
    SummarizedExperiment::assays(pe) <- c(S4Vectors::SimpleList(normalized = new_assay),
                                          SummarizedExperiment::assays(pe))
    
    if (verbose) {
        cat("|| - Adding within-replicate normalization factors to colData as \"withinRepScale\".\n")
    }

    ## add scaling factors to colData
    assay_fits <- dplyr::select(assay_fits, -Stratify, -Group, -isBaseline)
    if ("fits" %in% names(assay_fits)) {
        assay_fits <- dplyr::select(assay_fits, -fits)
    }

    ## merge by row names
    coldat <- merge(colData(pe), data.frame(assay_fits, row.names = "sample"),
                    by = 0, all = TRUE)
    rownames(coldat) <- coldat$Row.names
    coldat$Row.names <- NULL
    
    ## match colData row order with SE col order
    coldat <- coldat[match(colnames(pe), rownames(coldat)), , drop = FALSE]
    
    stopifnot(all(rownames(coldat) == colnames(pe)))
    colData(pe) <- coldat
    
    if (verbose) {
        cat("|| - Finished within-replicate normalization.\n")
        cat("|| - Returning PBMExperiment with", nrow(pe), "rows and", ncol(pe), "columns.\n")
    }

    return(pe)
}