##' @title Aggregate an assay's quantitative features
##'
##' @description
##'
##' This function aggregates the quantitative features of an assay,
##' applying a summarisation function (`fun`) to sets of features.
##' The `fcol` variable name points to a rowData column that defines
##' how to group the features during aggregate. This variable can
##' eigher be a vector (we then refer to an *aggregation by vector*)
##' or an adjacency matrix (*aggregation by matrix*).
##'
##' The rowData of the aggregated `SummarizedExperiment` assay
##' contains a `.n` variable that provides the number of parent
##' features that were aggregated.
##'
##' When aggregating with a vector, the newly aggregated
##' `SummarizedExperiment` assay also contains a new `aggcounts` assay
##' containing the aggregation counts matrix, i.e. the number of
##' features that were aggregated for each sample, which can be
##' accessed with the `aggcounts()` accessor.
##'
##' @param object An instance of class [QFeatures] or [SummarizedExperiment].
##'
##' @param i The index or name of the assay which features will be
##' aggregated the create the new assay.
##'
##' @param fcol A `character(1)` naming a rowdata variable (of assay
##' `i` in case of a `QFeatures`) defining how to aggregate the
##' features of the assay. This variable is either a `character`
##' or a (possibly sparse) matrix. See below for details.
##'
##' @param name A `character(1)` naming the new assay. Default is
##' `newAssay`. Note that the function will fail if there's
##' already an assay with `name`.
##'
##' @param fun A function used for quantitative feature
##' aggregation. See Details for examples.
##'
##' @param ... Additional parameters passed the `fun`.
##'
##' @return A `QFeatures` object with an additional assay or a
##' `SummarizedExperiment` object (or subclass thereof).
##'
##' @details
##'
##' Aggregation is performed by a function that takes a matrix as
##' input and returns a vector of length equal to `ncol(x)`. Examples
##' thereof are
##'
##' - [MsCoreUtils::medianPolish()] to fits an additive model (two way
##' decomposition) using Tukey's median polish_ procedure using
##' [stats::medpolish()];
##'
##' - [MsCoreUtils::robustSummary()] to calculate a robust aggregation
##' using [MASS::rlm()] (default);
##'
##' - [base::colMeans()] to use the mean of each column;
##'
##' - `colMeansMat(x, MAT)` to aggregate feature by the calculating
##' the mean of peptide intensities via an adjacency matrix. Shared
##' peptides are re-used multiple times.
##'
##' - [matrixStats::colMedians()] to use the median of each column.
##'
##' - [base::colSums()] to use the sum of each column;
##'
##' - `colSumsMat(x, MAT)` to aggregate feature by the summing the
##' peptide intensities for each protein via an adjacency
##' matrix. Shared peptides are re-used multiple times.
##'
##' See [MsCoreUtils::aggregate_by_vector()] for more aggregation functions.
##'
##' @section Missing quantitative values:
##'
##' Missing quantitative values have different effects based on the
##' aggregation method employed:
##'
##' - The aggregation functions should be able to deal with missing
##' values by either ignoring or propagating them. This is often
##' done with an `na.rm` argument, that can be passed with
##' `...`. For example, `rowSums`, `rowMeans`, `rowMedians`,
##' ... will ignore `NA` values with `na.rm = TRUE`, as illustrated
##' below.
##'
##' - Missing values will result in an error when using `medpolish`,
##' unless `na.rm = TRUE` is used. Note that this option relies on
##' implicit assumptions and/or performes an implicit imputation:
##' when summing, the values are implicitly imputed by 0, assuming
##' that the `NA` represent a trully absent features; when
##' averaging, the assumption is that the `NA` represented a
##' genuinely missing value.
##'
##' - When using robust summarisation, individual missing values are
##' excluded prior to fitting the linear model by robust
##' regression. To remove all values in the feature containing the
##' missing values, use [filterNA()].
##'
##' More generally, missing values often need dedicated handling such
##' as filtering (see [filterNA()]) or imputation (see [impute()]).
##'
##' @section Missing values in the row data:
##'
##' Missing values in the row data of an assay will also impact the
##' resulting (aggregated) assay row data, as illustrated in the
##' example below. Any feature variables (a column in the row data)
##' containing `NA` values will be dropped from the aggregated row
##' data. The reasons underlying this drop are detailed in the
##' `reduceDataFrame()` manual page: only invariant aggregated rows,
##' i.e. rows resulting from the aggregation from identical variables,
##' are preserved during aggregations.
##'
##' The situation illustrated below should however only happen in rare
##' cases and should often be imputable using the value of the other
##' aggregation rows before aggregation to preserve the invariant
##' nature of that column. In cases where an `NA` is present in an
##' otherwise variant column, the column would be dropped anyway.
##'
##' @section Using an adjacency matrix:
##'
##' When considering non-unique peptides explicitly, i.e. peptides
##' that map to multiple proteins rather than as a protein group, it
##' is convenient to encode this ambiguity explicitly using a
##' peptide-by-proteins (sparse) adjacency matrix. This matrix is
##' typically stored in the rowdata and set/retrieved with the
##' [adjacencyMatrix()] function. It can be created manually (as
##' illustrated below) or using `PSMatch::makeAdjacencyMatrix()`.
##'
##' @seealso The *QFeatures* vignette provides an extended example and
##' the *Processing* vignette, for a complete quantitative
##' proteomics data processing pipeline. The
##' [MsCoreUtils::aggregate_by_vector()] manual page provides
##' further details.
##'
##' @aliases aggregateFeatures aggregateFeatures,QFeatures-method
##' aggcounts aggcounts,SummarizedExperiment-method
##' adjacencyMatrix,SummarizedExperiment-method
##' adjacencyMatrix,QFeatures-method
##'
##' @name aggregateFeatures
##'
##' @rdname QFeatures-aggregate
##'
##' @importFrom MsCoreUtils aggregate_by_vector aggregate_by_matrix robustSummary colCounts
##'
##' @importFrom methods as
##'
##' @examples
##'
##' ## ---------------------------------------
##' ## An example QFeatures with PSM-level data
##' ## ---------------------------------------
##' data(feat1)
##' feat1
##'
##' ## Aggregate PSMs into peptides
##' feat1 <- aggregateFeatures(feat1, "psms", "Sequence", name = "peptides")
##' feat1
##'
##' ## Aggregate peptides into proteins
##' feat1 <- aggregateFeatures(feat1, "peptides", "Protein", name = "proteins")
##' feat1
##'
##' assay(feat1[[1]])
##' assay(feat1[[2]])
##' aggcounts(feat1[[2]])
##' assay(feat1[[3]])
##' aggcounts(feat1[[3]])
##'
##' ## --------------------------------------------
##' ## Aggregation with missing quantitative values
##' ## --------------------------------------------
##' data(ft_na)
##' ft_na
##'
##' assay(ft_na[[1]])
##' rowData(ft_na[[1]])
##'
##' ## By default, missing values are propagated
##' ft2 <- aggregateFeatures(ft_na, 1, fcol = "X", fun = colSums)
##' assay(ft2[[2]])
##' aggcounts(ft2[[2]])
##'
##' ## The rowData .n variable tallies number of initial rows that
##' ## were aggregated (irrespective of NAs) for all the samples.
##' rowData(ft2[[2]])
##'
##' ## Ignored when setting na.rm = TRUE
##' ft3 <- aggregateFeatures(ft_na, 1, fcol = "X", fun = colSums, na.rm = TRUE)
##' assay(ft3[[2]])
##' aggcounts(ft3[[2]])
##'
##' ## -----------------------------------------------
##' ## Aggregation with missing values in the row data
##' ## -----------------------------------------------
##' ## Row data results without any NAs, which includes the
##' ## Y variables
##' rowData(ft2[[2]])
##'
##' ## Missing value in the Y feature variable
##' rowData(ft_na[[1]])[1, "Y"] <- NA
##' rowData(ft_na[[1]])
##'
##' ft3 <- aggregateFeatures(ft_na, 1, fcol = "X", fun = colSums)
##' ## The Y feature variable has been dropped!
##' assay(ft3[[2]])
##' rowData(ft3[[2]])
##'
##' ## --------------------------------------------
##' ## Using a peptide-by-proteins adjacency matrix
##' ## --------------------------------------------
##'
##' ## Let's use assay peptides from object feat1 and
##' ## define that peptide SYGFNAAR maps to proteins
##' ## Prot A and B
##'
##' se <- feat1[["peptides"]]
##' rowData(se)$Protein[3] <- c("ProtA;ProtB")
##' rowData(se)
##'
##' ## This can also be defined using anadjacency matrix, manual
##' ## encoding here. See PSMatch::makeAdjacencyMatrix() for a
##' ## function that does it automatically.
##' adj <- matrix(0, nrow = 3, ncol = 2,
##' dimnames = list(rownames(se),
##' c("ProtA", "ProtB")))
##' adj[1, 1] <- adj[2, 2] <- adj[3, 1:2] <- 1
##' adj
##'
##' adjacencyMatrix(se) <- adj
##' rowData(se)
##' adjacencyMatrix(se)
##'
##' ## Aggregation using the adjacency matrix
##' se2 <- aggregateFeatures(se, fcol = "adjacencyMatrix",
##' fun = MsCoreUtils::colMeansMat)
##'
##' ## Peptide SYGFNAAR was taken into account in both ProtA and ProtB
##' ## aggregations.
##' assay(se2)
##'
##'
##' ## Aggregation by matrix on a QFeature object works as with a
##' ## vector
##' ft <- QFeatures(list(peps = se))
##' ft <- aggregateFeatures(ft, "peps", "adjacencyMatrix", name = "protsByMat",
##' fun = MsCoreUtils::colMeansMat)
##' assay(ft[[2]])
##' rowData(ft[[2]])
NULL
##' @exportMethod aggregateFeatures
##' @rdname QFeatures-aggregate
setMethod("aggregateFeatures", "QFeatures",
function(object, i, fcol, name = "newAssay",
fun = MsCoreUtils::robustSummary, ...) {
if (isEmpty(object))
return(object)
## Check arguments
if (any(present <- name %in% names(object)))
stop("There's already one or more assays named: '",
paste0(name[present], collapse = "', '"), "'.")
i <- .normIndex(object, i)
if (length(i) != length(name)) stop("'i' and 'name' must have same length")
if (length(fcol) == 1) fcol <- rep(fcol, length(i))
if (length(i) != length(fcol)) stop("'i' and 'fcol' must have same length")
## Aggregate each assay
for (j in seq_along(i)) {
from <- i[[j]]
to <- name[[j]]
by <- fcol[[j]]
## Create the aggregated assay
aggAssay <- .aggregateQFeatures(object[[from]],
by, fun, ...)
## Add the assay to the QFeatures object
object <- addAssay(object, aggAssay, name = to)
## Link the input assay to the aggregated assay
object <- addAssayLink(object, from = from,
to = to, varFrom = by,
varTo = by)
}
object
})
##' @exportMethod aggregateFeatures
##' @rdname QFeatures-aggregate
setMethod("aggregateFeatures", "SummarizedExperiment",
function(object, fcol, fun = MsCoreUtils::robustSummary, ...)
.aggregateQFeatures(object, fcol, fun, ...))
.aggregateQFeatures <- function(object, fcol, fun, ...) {
## Copied from the PSMatch package, given that it is not available
## on Bioconductor yet.
.makePeptideProteinVector <- function(m, collapse = ";") {
stopifnot(inherits(m, "Matrix"))
vec <- rep(NA_character_, nrow(m))
for (i in seq_len(nrow(m)))
vec[i] <- paste(names(which(m[i, ] != 0)), collapse = collapse)
names(vec) <- rownames(m)
vec
}
if (missing(fcol))
stop("'fcol' is required.")
m <- assay(object, 1)
rd <- rowData(object)
if (!fcol %in% names(rd))
stop("'fcol' not found in the assay's rowData.")
groupBy <- rd[[fcol]]
## Store class of assay i in case it is not a SummarizedExperiment
## so that the aggregated assay can be reverted to that class
.class <- class(object)
## Message about NA values is quant/row data
has_na <- character()
if (anyNA(m))
has_na <- c(has_na, "quantitative")
if (anyNA(rd, recursive = TRUE))
has_na <- c(has_na, "row")
if (length(has_na)) {
msg <- paste(paste("Your", paste(has_na, collapse = " and "),
" data contain missing values."),
"Please read the relevant section(s) in the",
"aggregateFeatures manual page regarding the",
"effects of missing values on data aggregation.")
message(paste(strwrap(msg), collapse = "\n"))
}
if (is.vector(groupBy) & !is.list(groupBy)) { ## atomic vectors
aggregated_assay <- aggregate_by_vector(m, groupBy, fun, ...)
aggcount_assay <- aggregate_by_vector(m, groupBy, colCounts)
aggregated_rowdata <- QFeatures::reduceDataFrame(rd, rd[[fcol]],
simplify = TRUE,
drop = TRUE,
count = TRUE)
assays <- SimpleList(assay = aggregated_assay, aggcounts = aggcount_assay)
rowdata <- aggregated_rowdata[rownames(aggregated_assay), , drop = FALSE]
} else if (is(groupBy, "Matrix")) {
aggregated_assay <- aggregate_by_matrix(m, groupBy, fun, ...)
## Remove the adjacency matrix that should be dropped anyway
rd[[fcol]] <- NULL
## Temp variable for unfolding and reducing - removed later
rd[["._vec_"]] <- .makePeptideProteinVector(groupBy)
rd <- unfoldDataFrame(rd, "._vec_")
aggregated_rowdata <- reduceDataFrame(rd, rd[["._vec_"]], drop = TRUE)
aggregated_rowdata[["._vec_"]] <- NULL
## Count the number of peptides per protein
.n <- apply(groupBy != 0, 2, sum)
aggregated_rowdata[[".n"]] <- .n[rownames(aggregated_rowdata)]
assays <- SimpleList(assay = as.matrix(aggregated_assay)) ## to discuss
rowdata <- aggregated_rowdata[rownames(aggregated_assay), , drop = FALSE]
} else stop("'fcol' must refer to an atomic vector or a sparse matrix.")
se <- SummarizedExperiment(assays = assays,
colData = colData(object),
rowData = rowdata)
## If the input objects weren't SummarizedExperiments, then try to
## convert the merged assay into that class. If the conversion
## fails, keep the SummarizedExperiment, otherwise use the
## converted object (see issue #78).
if (.class != "SummarizedExperiment")
se <- tryCatch(as(se, .class),
error = function(e) se)
return(se)
}
##' @export
##'
##' @importFrom ProtGenerics adjacencyMatrix
##'
##' @rdname QFeatures-aggregate
##'
##' @param object An instance of class `SummarizedExperiment` or
##' `QFeatures`.
##'
##' @param adjName `character(1)` with the variable name containing
##' the adjacency matrix. Default is `"adjacencyMatrix"`.
##'
##' @param i The index or name of the assays to extract the advaceny
##' matrix from. All must have a rowdata variable named `adjName`.
setMethod("adjacencyMatrix", "QFeatures",
function(object, i, adjName = "adjacencyMatrix")
List(lapply(experiments(object)[i],
.adjacencyMatrix,
adjName = adjName)))
setMethod("adjacencyMatrix", "SummarizedExperiment",
function(object, adjName = "adjacencyMatrix")
.adjacencyMatrix(object, adjName))
##' @export
##'
##' @rdname QFeatures-aggregate
##'
##' @param i When adding an adjacency matrix to an assay of a
##' `QFeatures` object, the index or name of the assay the
##' adjacency matrix will be added to. Ignored when `x` is an
##' `SummarizedExperiment`.
##'
##' @param value An adjacency matrix with row and column names. The
##' matrix will be coerced to compressed, column-oriented sparse
##' matrix (class `dgCMatrix`) as defined in the `Matrix` package,
##' as generaled by the [sparseMatrix()] constructor.
"adjacencyMatrix<-" <- function(object, i, adjName = "adjacencyMatrix", value) {
validAdjacencyMatrix(value)
## Coerse to a sparse matrix
value <- as(value, "sparseMatrix")
if (inherits(object, "SummarizedExperiment")) {
if (!identical(rownames(value), rownames(object)))
stop("Row names of the SummarizedExperiment and the adjacency matrix must match.")
if (adjName %in% colnames(rowData(object)))
stop("Found an existing variable ", adjName, ".")
rowData(object)[[adjName]] <- value
return(object)
}
stopifnot(inherits(object, "QFeatures"))
if (length(i) != 1)
stop("'i' must be of length one. Repeat the call to add a matrix to multiple assays.")
if (is.numeric(i) && i > length(object))
stop("Subscript is out of bounds.")
if (is.character(i) && !(i %in% names(object)))
stop("Assay '", i, "' not found.")
se <- object[[i]]
adjacencyMatrix(se, adjName = adjName) <- value
object[[i]] <- se
return(object)
}
.adjacencyMatrix <- function(x, adjName = "adjacencyMatrix") {
stopifnot(adjName %in% names(rowData(x)))
ans <- rowData(x)[[adjName]]
if (is.null(colnames(ans)) | is.null(rownames(ans)))
warning("The adjacency matrix should have row and column names.")
if (!is(ans, "sparseMatrix"))
warning("The adjacency matrix should ideally be sparse.")
ans
}
validAdjacencyMatrix <- function(x) {
if (is.null(colnames(x)) | is.null(rownames(x)))
stop("The matrix must have row and column names.")
if (any(Matrix::rowSums(x) == 0))
stop("rowSums() == 0 detected: peptides must belong to at least one protein.")
if (any(Matrix::colSums(x) == 0))
stop("colSums() == 0 detected: proteins must be identified by at least one peptide.")
TRUE
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.