R/ScpModel-ComponentAnalysis.R

Defines functions .formatParamsMapping .formatEigenVectors .addEigenArrows .scaleComponentsToUnity scpComponentBiplot .trimPlotLevels .pcaAxisLabels .plotPCA scpComponentPlot scpComponentAggregate .formatPcaList .svdWrapper .nipalsWrapper .runASCA.E .runASCA .runAPCA .checkPcaResults .componentsToTable .addPcaToList .getPcaFunction scpComponentAnalysis

Documented in scpComponentAggregate scpComponentAnalysis scpComponentBiplot scpComponentPlot

## ---- SCP Component Analysis ----

##' @name ScpModel-ComponentAnalysis
##'
##' @title Component analysis for single cell proteomics
##'
##' @description
##'
##' Component analysis is a powerful tool for exploring data. The
##' package implements the ANOVA-principal component analysis
##' extended to linear models (APCA+) and derivatives (suggested by
##' Thiel at al. 2017). This framework is based on principal component
##' analysis (PCA) and allows exploring the data captured by each
##' model variable individually.
##'
##' @section PCA - notation and algorithms:
##'
##' Given \eqn{A} a m x n matrix, PCA can be summarized as the
##' following decomposition:
##'
##' \deqn{AA^T / (n - 1) = VLV^T}
##'
##' Where \eqn{V} is a m x k orthogonal matrix, that is \eqn{VV^T = I},
##' with k the number of components. \eqn{V} is called the matrix of
##' eigenvectors. \eqn{L} is the k x k diagonal matrix of eigenvalues
##' that contains the variance associated to each component, ordered
##' from highest to lowest variance. The unscaled PC scores are given
##' by \eqn{S = A^TV}.
##'
##' There are 2 available algorithm to perform PCA:
##'
##' - `nipals`: The non-linear iterative partial least squares
##'   (NIPALS) algorithm **can handle missing values** and
##'   approximates classical PCA, although it does not explicitly
##'   maximize the variance. This is implemented in [nipals::nipals()].
##' - `svd`: The singular value decomposition (SVD) is used to perform
##'   an exact PCA, but it **cannot handle missing values**. This is
##'   implemented in [base::svd()].
##'
##' Which algorithm to use is controlled by the `pcaFUN` argument, by
##' default (`"auto"`), the function automatically uses `svd` when
##' there is no missing values and `nipals` when there is at least
##' one missing value.
##'
##' @section Component analysis methods:
##'
##' `scpComponentAnalysis()` performs a PCA on the modelling output.
##' What modelling output the function will use depends on the
##' `method`. The are 3 PCA approaches:
##'
##' - `ASCA` performs a PCA on the effect matrix, that is
##'   \eqn{A = \hat{M_f}} where \eqn{f} is one of the effects in the
##'   model. This PCA is useful to explore the modelled effects and
##'   the relationship between different levels of a factor.
##' - `ASCA.E`: perform PCA on the effect matrix, just like ASCA. The
##'   scores are then updated by projecting the effect matrix added to
##'   the residuals using the eigenvectors, that is
##'   \eqn{scores = (\hat{M_f} + \epsilon)^TV}. This PCA is useful
##'   to explore the modelled effects while blurring these effects
##'   with the unmodelled variability. Note however that for this
##'   approach, the scores are no longer guaranteed to be orthogonal
##'   and the eigenvalues are no longer meaningful. The percentage of
##'   variation should not be interpreted.
##' - `APCA` (default) performs PCA on the effect matrix plus the
##'   residuals, that is \eqn{A = \hat{M_f} + \epsilon}. This PCA
##'   is useful to explore the modelled effects in relation with the
##'   unmodelled variability that is remaining in the residuals.
##'
##' Available methods are listed in `scpModelComponentMethods`.
##' Note that for all three methods, a PCA on the residual matrix is
##' also performed when `residuals = TRUE`, that is
##' \eqn{A = \epsilon = Y - \hat{\beta}X^T}. A PCA on the residuals is
##' useful to explore residual effects that are not captured by any
##' effect in the model. Similarly, a PCA on the input data matrix,
##' that is on the data before modelling is also performed when
##' `unmodelled = TRUE`, that is \eqn{A = Y}.
##'
##' `scpComponentAnalysis()` always returns a list with 2 elements.
##' The first element, `bySample` is a list where each element
##' contains the PC scores for the desired model variable(s). The
##' second element, `byFeature` is a list where each element
##' contains the eigenvectors for the desired model variable(s).
##'
##' @section Exploring component analysis results:
##'
##' [scpAnnotateResults()] adds annotations to the component
##' analysis results. The annotations are added to all elements of the
##' list returned by `scpComponentAnalysis()`. See the associated man
##' page for more information.
##'
##' `scpComponentPlot()` takes one of the two elements of the list
##' generated by `scpComponentAnalysis()` and returns a list of
##' `ggplot2` scatter plots. Commonly, the first two components,
##' that bear most of the variance, are explored for visualization,
##' but other components can be explored as well thanks to the `comp`
##' argument. Each point represents either a sample or a feature,
##' depending on the provided component analysis results
##' (see examples). Change the point aesthetics by providing ggplot
##' arguments in a list (see examples).
##'
##' `scpComponentBiplot()` simultaneously explores the PC scores
##' (sample-space) and the eigenvectors (feature-space). Scores are
##' shown as points while eigenvectors are shown as arrows. Point
##' aesthetics and arrow aesthetics can be controlled with the
##' `pointParams` and the `arrowParams` arguments, respectively.
##' Moreover, arrows are also labelled and label aesthetics can be
##' controlled using `labelParams` and `textBy`. Plotting all
##' eigenvectors as arrows leads to overcrowded plots. You can limit the plotting to
##' the top longest arrows (default to the top 10) as defined by the
##' distance on the two selected PCs.
##'
##' `scpComponentAggregate()` offers functionality to aggregate the
##' results from multiple features. This can be used to obtain, for
##' example, component analysis results for proteins when modelling at
##' the peptide level. The approach is inspired from
##' [scuttle::aggregateAcrossCells()](https://bioconductor.org/packages/release/bioc/html/scuttle.html)
##' and combines, for each group, multiple values for each component
##' using [QFeatures::aggregateFeatures()]. By default, values are
##' aggregated using the median, but `QFeatures` offers other methods
##' as well. The annotation of the component results are automatically
##' aggregated as well. See the `aggregateFeatures()` man page for
##' more information on available methods and expected behavior.
##'
##' @seealso
##'
##' - [ScpModel-Workflow] to run a model on SCP data upstream of
##'   component analysis.
##' - The [nipals::nipals()] function and package for detailed
##'   information about the algorithm and associated parameters.
##' - The [ggplot2::ggplot()] functions and associated tutorials to
##'   manipulate and save the visualization output
##' - [scpAnnotateResults()] to annotate component analysis results.
##'
##' @references
##'
##' Thiel, Michel, Baptiste FĂ©raud, and Bernadette Govaerts. 2017.
##' "ASCA+ and APCA+: Extensions of ASCA and APCA in the Analysis of
##' Unbalanced Multifactorial Designs." Journal of Chemometrics 31
##' (6): e2895.
##'
##' @author Christophe Vanderaa, Laurent Gatto
##'
##' @example inst/examples/examples_ScpModel-ComponentAnalysis.R
##'
NULL

## ---- Analysis functions ----

##' @name ScpModel-ComponentAnalysis
##'
##' @param object An object that inherits from the
##'     `SingleCellExperiment` class. It must contain an estimated
##'     `ScpModel` in its metadata.
##'
##' @param method A `character()` indicating which approach(es) to use
##'     for principal component analysis (PCA). Are allowed:
##'     `"APCA"` (default), `"ASCA"` and/or `"ASCA.E"` (multiple
##'     values are allowed). `"ASCA"`, `"APCA"`, `"ASCA.E"` are
##'     iterated through each desired effects.
##'
##' @param effects A `character()` indicating on which model variables
##'     the component analysis should be performed. Default to all
##'     modelled variables.
##'
##' @param pcaFUN A `character(1)` indicating which function to use to
##'     perform PCA. "nipals" will use [nipals::nipals()] while "svd"
##'     will use [base::svd()]. If "auto", the function uses "nipals"
##'     if the data contain missing values and "svd" otherwise.
##'
##' @param residuals A `logical(1)`, if `TRUE`, PCA is performed on
##'     the residual matrix as well.
##'
##' @param unmodelled A `logical(1)`, if `TRUE`, PCA is performed on
##'     the input matrix as well.
##'
##' @param name A `character(1)` providing the name to use to retrieve
##'     the model results. When retrieving a model and `name` is
##'     missing, the name of the first model found in `object` is used.
##'
##' @param ... For `scpComponentAnalysis()`, further arguments passed
##'     to the PCA function. For `scpComponentAggregate()`, further
##'     arguments passed to [QFeatures::aggregateFeatures()].
##'
##' @export
scpComponentAnalysis <- function(object,
                                 method = NULL,
                                 effects = NULL,
                                 pcaFUN = "auto",
                                 residuals = TRUE,
                                 unmodelled = TRUE,
                                 name, ...) {
    if (is.null(effects)) effects <- scpModelEffectNames(object, name)
    if (any(mis <- !effects %in% scpModelEffectNames(object, name)))
        stop("'", paste0(effects[mis], collapse = "', '"),
             "' is/are not modelled effects.")
    if (any(mis <- !method %in% scpModelComponentMethods))
        stop("Allowed values for 'method' are '",
             paste0(scpModelComponentMethods, collapse = "', '"), "'.")
    pcaFUN <- .getPcaFunction(pcaFUN, object, name)
    out <- List()
    if (unmodelled) {
        pcaUnmod <- pcaFUN(scpModelInput(object, name), ...)
        out <- .addPcaToList(pcaUnmod, out, "unmodelled_")
    }
    if (residuals) {
        pcaRes <- pcaFUN(scpModelResiduals(object, name), ...)
        out <- .addPcaToList(pcaRes, out, "residuals_")
    }
    for (m in method) {
        runMethod <- get(paste0(".run", m))
        effectMats <- scpModelEffects(object, name)
        for (effect in effects) {
            pcaTables <- runMethod(
                effectMats[[effect]],
                scpModelResiduals(object, name), pcaFUN, ...
            )
            out <- .addPcaToList(
                pcaTables, out, paste0(m, "_", effect, "_")
            )
        }
    }
    if (!length(out))
        message("No components were computed. Make sure to provide ",
                "'method' or set 'unmodelled = TRUE' or 'residuals = ",
                "TRUE'.")
    .formatPcaList(out)
}

## Internal function that returns the desired PCA algorithm function
## or that automatically selects a suitable algorithm function.
##
## @param pcaFUN A `character` indicating which function to use to
##     perform PCA. "nipals" will use [nipals::nipals()] while "svd"
##     will use [base::svd()]. If "auto", the function uses "nipals"
##     if the data contain missing values and "svd" otherwise.
## @param object An object that inherits from the
##     `SingleCellExperiment` class. It must contain an estimated
##     `ScpModel` in its metadata. Ignored when pcaFun != "auto".
## @param name A `character(1)` providing the name to use to retrieve
##     the model results. When retrieving a model and `name` is
##     missing, the name of the first model found in `object` is used.
##     Ignored when pcaFun != "auto".
##
.getPcaFunction <- function(pcaFun, object, name) {
    if (pcaFun == "auto") {
        pcaFun <- ifelse(anyNA(scpModelInput(object, name)), "nipals", "svd")
    }
    if (pcaFun == "nipals") {
        .nipalsWrapper
    } else if (pcaFun == "svd") {
        .svdWrapper
    } else {
        stop("Available PCA functions are: 'nipals' or 'svd'")
    }
}

## Internal function that takes PCA results provided as a list,
## converts these results into tables and appends the tables to a
## list-like object. A prefix can be provided to facilitate naming the
## newly appended elements in the returned lists.
##
## @param pcaResults A list containing PCA results. The elements
##     should at least contain the following elements: "scores",
##     "eigenvectors", "eigenvalues", "proportionVariance". Any
##     additional element in the list is ignored.
## @param pcaList A list to which the PCA result tables should be
##     appended.
## @param prefix A character(1) providing a prefix to add to the new
##     element names. By default, the new elements are named
##     "bySample" and "byFeatures". Providing another prefix will lead
##     to naming elements `paste0(prefix, "bySample")` and
##     `paste0(prefix, "byFeature")`
##
.addPcaToList <- function(pcaResults, pcaList, prefix = "") {
    newElements <- .componentsToTable(pcaResults)
    names(newElements) <- paste0(prefix, names(newElements))
    c(pcaList, newElements)
}

## Internal function to convert a PCA result list into a DataFrameList.
## The function extracts the scores and the eigenvectors, converts
## them to a DataFrame, and adds the proportion of explained variance
## in the metadata so it can be easily retrieved for plotting. The
## two tables are returned in a List.
##
## @param x A list containing PCA results. The elements
##     should at least contain the following elements: "scores",
##     "eigenvectors", "eigenvalues", "proportionVariance". Any
##     additional element in the list is ignored.
##
##' @importFrom S4Vectors metadata<- List
##' @importFrom methods as
.componentsToTable <- function(x) {
    .checkPcaResults(x)
    scores <- as(x$scores, "DataFrame")
    eigenvectors <- as(x$eigenvectors, "DataFrame")
    propVariance <- x$proportionVariance
    names(propVariance) <- colnames(scores)
    scores$cell <- rownames(scores)
    eigenvectors$feature <- rownames(eigenvectors)
    metad <- list(proportionVariance = propVariance)
    metadata(scores) <- metadata(eigenvectors) <- metad
    List(bySample = scores, byFeature = eigenvectors)
}

## Internal function that checks that PCA results comply to the
## expected data representation. We expect the PCA results to be a
## list with at least the following elements:
## 1. `scores`: a matrix with computed scores with rows corresponding
##    to columns (samples) in the input data and the columns
##    corresponding to the latent components.
## 2. `eigenvectors`: a matrix with eigenvectors with rows corresponding
##    to rows (features) in the input data and the columns
##    corresponding to the latent components.
## 3. `eigenvalues`: a vector containing the eigenvalues, that is the
##    variance associated to each principal component.
## 4. `proportionVariance`: a vector containing the proportion of the
##    variance explained by each component.
##
## @param x A list containing PCA results. The elements
##     should at least contain the following elements: "scores",
##     "eigenvectors", "eigenvalues", "proportionVariance". Any
##     additional element in the list is ignored.
##
.checkPcaResults <- function(x) {
    if (!is.list(x)) stop("PCA results must be a list.")
    expNames <- c("scores", "eigenvectors", "eigenvalues", "proportionVariance")
    if (!all(expNames %in% names(x)))
        stop(
            "PCA results must at least contain the following ",
            "elements: ", paste(expNames, collapse = ", "), "."
        )
    if (is.null(names(x$eigenvalues)))
        stop("Components must be named.")
    if (!identical(colnames(x$scores), names(x$eigenvalues)))
        stop("Components in scores do not match the eigenvalues.")
    if (!identical(colnames(x$eigenvectors), names(x$eigenvalues)))
        stop("Components in eigenvectors do not match the eigenvalues.")
    if (!identical(names(x$proportionVariance), names(x$eigenvalues)))
        stop("The proportions of variance explained do not match the eigenvalues.")
    NULL
}

## Internal functions to perform APCA. APCA takes the pure effect
## matrix augmented with the residuals.
.runAPCA <- function(M, residuals, fun, ...) {
    fun(M + residuals, ...)
}

## Internal functions to perform ASCA. ASCA takes the pure effect
## matrix. the 'residuals' argument is not used but included to match
## the function signature of .runAPCA() and .runASCA.E
.runASCA <- function(M, residuals, fun, ...) {
    fun(M, ...)
}

## Internal functions to perform ASCA-E. ASCA-E takes the pure effect
## matrices and projects the eigenvectors on the augmented effect
## matrix. Note that scores are no longer orthogonal, hence the
## eigenvalues are no longer meaningful.
.runASCA.E <- function(M, residuals, fun, ...) {
    out <- fun(M, ...)
    M <- M + residuals
    M[is.na(M)] <- 0
    out$scores <- crossprod(M, out$eigenvectors)
    out$eigenvalues <- rep(NA, length(out$eigenvalues))
    out$proportionVariance <- rep(NA, length(out$proportionVariance))
    names(out$eigenvalues) <- names(out$proportionVariance) <- colnames(out$scores)
    out
}

## Internal wrapper around the nipals::nipals() function. The wrapper
## is necessary to format the output consistently, that is it makes
## sure the output is a list with 4 elements (detailed below) and that
## each element has expected dimension names.
##
## @param mat A data matrix where rows represent features and columns
##     represent samples. `NA`s are allowed.
## @param ncomp An `integer(1)` providing the number of principal
##     components to fit. Cannot exceed the smallest dimension of mat.
## @param center If `TRUE`, subtract the mean from each row of
##     `mat` before PCA. Declared to overwrite default values.
## @param startcol Same argument as in `nipals::nipals()` to control
##     the initial values at the first iteration for each component.
##     Declared to overwrite default values. By default,
##     `nipals` uses the column of mat that has maximum absolute sum,
##     but we noticed that it is better to with the column that has
##     the most observed values.
## @param scale if `TRUE`, divide the standard deviation from each
##     row of `mat` before PCA. Declared to overwrite default values.
## @param ... Further arguments passed to `nipals::nipals`.
##
## Returns a list with 4 elements:
##
## 1. `scores`: a matrix with computed scores with rows corresponding
##    to columns (samples) in the input data and the columns
##    corresponding to the latent components.
## 2. `eigenvectors`: a matrix with eigenvectors with rows corresponding
##    to rows (features) in the input data and the columns
##    corresponding to the latent components.
## 3. `eigenvalues`: a vector containing the eigenvalues, that is the
##    variance associated to each principal component.
## 4. `proportionVariance`: a vector containing the proportion of the
##    variance explained by each component.
##
## There are some inconsistencies in the terminology used by NIPALS:
## - The 'scores' output is in fact the standardized scores, we scale
##   those with the eigenvalues to obtain the scores.
## - The 'eig' output seems to refer to eigenvalues (ie variance
##   assiociated to each components), but in fact those are the
##   square root of the sums of squares.
## - The 'loadings' output does not contain the PC loadings but the
##   unscaled eigenvectors. loadings = eigenvectors * sqrt(eigenvalue)
##
## This is inspired from
## https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca
##
##' @importFrom nipals nipals
.nipalsWrapper <- function(mat, ncomp = 2L, center = TRUE,
                           startcol = function(x) sum(!is.na(x)),
                           scale = FALSE, ...) {
    if (ncomp > min(nrow(mat), ncol(mat)))
        stop("'ncomp' cannot exceeded number of features or samples, ",
             "whichever is smallest.")
    mat <- mat[rowSums(!is.na(mat)) > 0, ]
    res <- nipals(t(mat), ncomp = ncomp, center = center,
                  startcol = startcol,
                  scale = scale, ...)
    scores <- res$scores %*% diag(res$eig)
    colnames(scores) <- paste0("PC", 1:ncomp)
    names(res$R2) <- paste0("PC", 1:ncomp)
    names(res$eig) <- paste0("PC", 1:ncomp)
    list(
        scores = scores,
        eigenvectors = res$loadings,
        eigenvalues = res$eig^2 / (nrow(mat) - 1),
        proportionVariance = res$R2
    )
}

## Internal wrapper around base::svd(). Same as .nipalsWrapper().
##
## @param mat A data matrix where rows represent features and columns
##     represent samples. `NA`s are allowed.
## @param ncomp An `integer(1)` providing the number of principal
##     components to fit. Cannot exceed the smallest dimension of mat.
## @param center If `TRUE`, subtract the mean from each row of
##     `mat` before PCA. Declared to overwrite default values.
## @param scale if `TRUE`, divide the standard deviation from each
##     row of `mat` before PCA. Declared to overwrite default values.
## @param ... Further arguments passed to `base::svd()`.
##
## Note: eigenvalues = singular values^2 / (n - 1)
##
## Useful resource:
## https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca
##
.svdWrapper <- function(mat, ncomp = 2, center = TRUE,
                        scale = FALSE, ...) {
    if (any(is.na(mat)))
        stop("svd cannot deal with missing values. Use 'algorithm = ",
             "\"nipals\"' instead.")
    if (ncomp > min(nrow(mat), ncol(mat)))
        stop("'ncomp' cannot exceeded number of features or samples, ",
             "whichever is smallest.")
    res <- svd(scale(t(mat), center = center, scale = scale))
    scores <- (res$u %*% diag(res$d))[, 1:ncomp]
    eigenvectors <- res$v[, 1:ncomp]
    dimnames(scores) <- list(colnames(mat), paste0("PC", 1:ncomp))
    dimnames(eigenvectors) <- list(rownames(mat), paste0("PC", 1:ncomp))
    names(res$d) <- paste0("PC", 1:ncomp)
    eigenvalues <- res$d^2 / (nrow(mat) - 1)
    proportionVariance <- eigenvalues / sum(eigenvalues)
    list(
        scores = scores,
        eigenvectors = eigenvectors,
        eigenvalues = eigenvalues[1:ncomp],
        proportionVariance = proportionVariance[1:ncomp]
    )
}

## Internal function that take a named list and formats it in a nested
## list. The element names must end either with "_bySample" or
## "_byFeature". The functions then nests the list so that the output
## list contains 2 elements named "bySample" and "byFeatures", each
## containing the corresponding elements from the input list.
##
## @param pcaList A List
##
.formatPcaList <- function(pcaList) {
    if (!length(pcaList)) return(List())
    if (any(!grepl("_bySample$|_byFeature$", names(pcaList))))
        stop("Unexpected names in PCA result list.")
    bySample <- pcaList[grepl("bySample", names(pcaList))]
    names(bySample) <- sub("_bySample$", "", names(bySample))
    byFeature <- pcaList[grepl("byFeature", names(pcaList))]
    names(byFeature) <- sub("_byFeature$", "", names(byFeature))
    List(bySample = bySample, byFeature = byFeature)
}

## ---- Aggregation functions ----

##' @name ScpModel-ComponentAnalysis
##'
##' @param componentList A list of components analysis results. This
##'     is typically the `bySample` or `byFeature` element of the list
##'     returned by `scpComponentAnalysis()`.
##'
##' @param fcol A `character(1)` providing the name of the column
##'     to use to group features.
##'
##' @param fun A `function` that summarizes the values for each group.
##'     See [QFeatures::aggregateFeatures()] for a list of available
##'     functions.
##'
##' @importFrom matrixStats colMedians
##' @importFrom QFeatures readSummarizedExperiment aggregateFeatures
##' @importFrom SummarizedExperiment assays rowData
##' @importFrom S4Vectors metadata
##'
##' @export
scpComponentAggregate <- function(componentList, fcol,
                                  fun = colMedians, ...) {
    if (any(mis <- sapply(componentList, function(x) !fcol %in% colnames(x))))
        stop("'", fcol, "' not found in list element(s): ",
             paste0(names(componentList)[mis], collapse = ", "), ".")
    message("Components may no longer be orthogonal after aggregation.")
    endoapply(componentList, function(x) {
        md <- metadata(x)
        out <- readSummarizedExperiment(
            data.frame(x), grepl("^PC\\d*$", colnames(x))
        )
        out <- suppressMessages(aggregateFeatures(out, fcol, fun, ...))
        out <- DataFrame(assays(out)[[1]], rowData(out))
        metadata(out) <- md
        out
    })
}

## ---- Plotting functions ----

##' @name ScpModel-ComponentAnalysis
##'
##' @param componentList A list of components analysis results. This
##'     is typically the `bySample` or `byFeature` element of the list
##'     returned by `scpComponentAnalysis()`.
##'
##' @param comp An `integer(2)` pointing to which components to fit.
##'     The values of `comp` are not allowed to exceed the number of
##'     computed components in `componentList`.
##'
##' @param pointParams A `list` where each element is an argument that
##'     is provided to [ggplot2::geom_point()]. This is useful to
##'     change point size, transparency, or assign colour based on an
##'     annotation (see [ggplot2::aes()]).
##'
##' @param maxLevels An `integer(1)` indicating how many colour levels
##'     should be shown on the legend when colours are derived from a
##'     discrete factor. If `maxLevels = NULL`, all levels are shown.
##'     This parameters is useful to colour points based on a factor
##'     with many levels that would otherwise overcrowd the legend.
##'
##' @export
scpComponentPlot <- function(componentList,
                             comp = 1:2,
                             pointParams = list(),
                             maxLevels = NULL) {
    pl <- lapply(names(componentList), function(i) {
        propVar <- metadata(componentList[[i]])$proportionVariance
        if (any(comp > length(propVar))) stop("'comp' is out of bounds.")
        .plotPCA(componentList[[i]], comp, propVar, pointParams) +
            ggtitle(sub("_", " on ", i))
    })
    names(pl) <- names(componentList)
    if (!is.null(maxLevels)) {
        pl <- lapply(pl, .trimPlotLevels, maxLevels)
    }
    pl
}

## Internal function that generates a ggplot from PCA results as
## generated by scpComponentAnalysis()
##
## @param pcs A DataFrame table as generated by scpComponentAnalysis()
## @param comp A numeric(2) indicating which 2 components to show, it
##     cannot exceed the number of available components
## @param proportionVariance A numeric(2) providing the associated
##     proportion of explained by each of the 2 components. NA allowed.
##     Used for generating PC axis labels.
## @pointParams A list of geom_point() parameters.
##
.plotPCA <- function(pcs, comp, proportionVariance, pointParams) {
    stopifnot(length(comp) == 2)
    axislabels <- .pcaAxisLabels(proportionVariance, comp)
    ggplot(data.frame(pcs)) +
        aes(x = .data[[paste0("PC", comp[[1]])]],
            y = .data[[paste0("PC", comp[[2]])]]) +
        do.call(geom_point, pointParams) +
        labs(x = axislabels$x, y = axislabels$y) +
        theme_minimal()
}

## Internal function that generates the PCA axis labels, namely it formats
## the percentage of variance explained.
##
## @param proportionVariance A numeric(2) providing the associated
##     proportion of explained by each of the 2 components. NA allowed.
## @param comp A numeric(2) indicating which 2 components to show, it
##     cannot exceed the number of available components
#
.pcaAxisLabels <- function(proportionVariance, comp) {
    xlabel <- paste0("PC", comp[[1]])
    ylabel <- paste0("PC", comp[[2]])
    if (all(!is.na(proportionVariance))) {
        percentVar <- round(proportionVariance[comp] * 100, 1)
        xlabel <- paste0(xlabel, " (", percentVar[[1]],"%)")
        ylabel <- paste0(ylabel, " (", percentVar[[2]],"%)")
    }
    list(x = xlabel, y = ylabel)
}

## Internal function that reduces the number of colours shown on a
## ggplot legend when the colour is determined by a categorical
## variable. The colour variable is automatically retrieve from the
## ggplot object. The function only affects the legend.
##
## @param pl A ggplot object from which to trim the colour levels.
## @param maxLevels A numeric(1) indicating up to how many levels are
##     allowed on the colour legend.
.trimPlotLevels <- function(pl, maxLevels) {
    what <- pl$labels$colour
    if (is.null(what)) return(pl)
    col <- pl$data[[what]]
    if ((is.factor(col) | is.character(col)) &
        length(unique(col)) > maxLevels) {
        levels <- unique(col)
        sel <- round(seq(1, length(levels), length.out = maxLevels))
        breaks <- as.character(levels[sel])
        pl <- pl + scale_color_discrete(breaks = breaks)
    }
    pl
}

##' @name ScpModel-ComponentAnalysis
##'
##' @param scoreList A list of components analysis results. This
##'     is typically the `bySample` element in the list returned by
##'     `scpComponentAnalysis()`.
##'
##' @param eigenvectorList A list of components analysis results. This
##'     is typically the `byFeature` element in the list returned by
##'     `scpComponentAnalysis()`.
##'
##' @param arrowParams A `list` where each element is an argument that
##'     is provided to [ggplot2::geom_segment()]. This is useful to
##'     change arrow head style, line width, transparency, or assign
##'     colour based on an annotation (see [ggplot2::aes()]). Note
##'     that changing the 'x', 'y', 'xend', and 'yend' aesthetics is
##'     not allowed.
##'
##' @param labelParams A `list` where each element is an argument that
##'     is provided to [ggrepel::geom_label_repel()]. This is useful
##'     to change  label size, transparency, or assign
##'     colour based on an annotation (see [ggplot2::aes()]). Note
##'     that changing the 'x', 'y', 'xend', and 'yend' aesthetics is
##'     not allowed.
##'
##' @param textBy A `character(1)` indicating the name of the column
##'     to use to label arrow heads.
##'
##' @param top An `integer(1)` indicating how many arrows should be
##'     drawn. The arrows are sorted based on their size as determined
##'     by the euclidean distance in the principal component space.
##'
##' @export
scpComponentBiplot <- function(scoreList,
                               eigenvectorList,
                               comp = 1:2,
                               pointParams = list(),
                               arrowParams = list(
                                   arrow = arrow(length = unit(0.2, "cm"))
                               ),
                               labelParams = list(
                                   size = 2,
                                   max.overlaps = 10
                               ),
                               textBy = "feature",
                               top = 10,
                               maxLevels = NULL) {
    pcn <- paste0("PC", comp)
    scoreList <- .scaleComponentsToUnity(scoreList, pcn)
    eigenvectorList <- .scaleComponentsToUnity(eigenvectorList, pcn)
    pl <- scpComponentPlot(
        scoreList, comp = comp, pointParams = pointParams,
        maxLevels = maxLevels
    )
    for (i in names(pl)) {
        pl[[i]] <- .addEigenArrows(
            pl[[i]], eigenvectorList[[i]], comp, textBy, top,
            arrowParams, labelParams
        )
    }
    pl
}

## Internal function that  makes sure that the principal components for features
## and samples are on the same scale. This is performed by scaling the
## data so that the largest observation has length one.
##
## @param compList A List(), typically the bySample or byFeature table
##     list generated by scpComponentAnalysis()
## @param compNames A character() indicating the column names of the
##     principal components to scale.
##
##' @importFrom S4Vectors endoapply
.scaleComponentsToUnity <- function(compList, compNames) {
    endoapply(compList, function(components) {
        x <- as.matrix(components[, compNames])
        maxLength <- max(sqrt(rowSums(x^2)))
        scales <- 1 / rep(maxLength, ncol(x))
        components[, compNames] <- x %*% diag(scales)
        components
    })
}

## Internal function that takes a score plot and adds eigenvector information as
## labelled arrows.
##
## @param pl A ggplot object to add eigenvectors as arrows
## @param eigenvectors A DataFrame table of eigenvector data. This is typically
##     one of the `byFeature` tables in the list returned by
##     `scpComponentAnalysis()`.
## @param comp A numeric(2) indicating which 2 components to show, it
##     cannot exceed the number of available components
## @param textBy A `character(1)` indicating the name of the column
##     to use to label arrow heads.
## @param top An `integer(1)` indicating how many arrows should be
##     drawn. The arrows are sorted based on their size as determined
##     by the euclidean distance in the principal component space.
## @param arrowParams A `list` where each element is an argument that
##     is provided to [ggplot2::geom_segment()]. This is useful to
##     change arrow head style, line width, transparency, or assign
##     colour based on an annotation (see [ggplot2::aes()]).
## @param labelParams A `list` where each element is an argument that
##     is provided to [ggrepel::geom_label_repel()]. This is useful
##     to change  label size, transparency, or assign
##     colour based on an annotation (see [ggplot2::aes()]).
##
##' @importFrom ggrepel geom_label_repel
.addEigenArrows <- function(pl, eigenvectors, comp, textBy, top,
                            arrowParams, labelParams) {
    evData <- .formatEigenVectors(
        eigenvectors, comp, textBy, top
    )
    if (!nrow(evData)) return(pl)
    arrowParams <- .formatParamsMapping(arrowParams, comp, isArrow = TRUE)
    labelParams <- .formatParamsMapping(labelParams, comp, isArrow = FALSE)
    labelParams$data <- arrowParams$data <- evData
    pl + ylim(-1, 1) + xlim(-1, 1) +
        do.call(geom_segment, arrowParams) +
        do.call(geom_label_repel, labelParams)
}

## Internal function that takes the eigenvector data, sorts the rows by the
## arrow length and takes the top rows
.formatEigenVectors <- function(eigenvectors, comp, textBy, top) {
    if (!textBy %in% colnames(eigenvectors))
        stop("'", textBy, "' not found in component tables. Use ",
             "'scpAnnotateResults()' to add custom annotations.")
    if (top == 0) return(data.frame(eigenvectors[0, , drop = FALSE]))
    if (top >= nrow(eigenvectors))
        return(data.frame(eigenvectors, label = eigenvectors[[textBy]]))
    eigenvectors$label <- eigenvectors[[textBy]]
    x <- as.matrix(eigenvectors[, paste0("PC", comp)])
    vectorLength <- sqrt(rowSums(x^2))
    sel <- order(vectorLength, decreasing = TRUE)[1:top]
    data.frame(eigenvectors[sel, , drop = FALSE])
}

## Internal function that combines mapping information (= aes())
## needed for plotting eigenvector arrows and labels with the
## information provided by the user (eg color, size, linewidth, ...).
## If the list is not named or if some elements do not have a name,
## the function assumes that the mapping is the first unnnamed
## element.
##
## The function return a list where the mapping parameters are formatted
## so to contained the imposed 'x', 'y', 'xend', and 'yend' aesthetics.
## These aesthetics cannot be used provided.
##
## @param aesParams A list of ggplot aesthetic parameters.
## @param comp A numeric(2) indicating which 2 components to show, it
##     cannot exceed the number of available components. It is used to
##     determine the 'xend' and 'yend' aesthetics.
## @param isArrow A logical(1) indicating wheter aesParams are related
##     to arrow parameters (TRUE) or labelling parameters (FALSE).
##
.formatParamsMapping <- function(aesParams, comp, isArrow) {
    if (!inherits(aesParams, "list"))
        stop("'labelParams' and 'arrowParams' must be a list.")
    if (is.null(names(aesParams)))
        names(aesParams) <- rep("", length(aesParams))
    names(aesParams)[which(names(aesParams) == "")[1]] <- "mapping"
    notAllowed <- c("x", "xend", "y", "yend")
    if (any(sel <- notAllowed %in% names(aesParams) |
            notAllowed %in% names(aesParams$mapping)))
        stop("'", paste(notAllowed[sel], collapse = "', '"), "' cannot be ",
             "user-provided.")
    posX <- quo(.data[[paste0("PC", comp[[1]])]])
    posY <- quo(.data[[paste0("PC", comp[[2]])]])
    if (isArrow) {
        imposedMapping <- list(x = 0, y = 0, xend = posX, yend = posY)
    } else {
        imposedMapping <- list(x = posX, y = posY, label = quo(.data$label))
    }
    aesParams$mapping <- do.call(
        aes, c(imposedMapping, aesParams$mapping)
    )
    aesParams
}
UCLouvain-CBIO/scp documentation built on Aug. 3, 2024, 3:38 a.m.