#se_collapse_by_row <- function
#' Collapse SummarizedExperiment data by row
#'
#' Collapse SummarizedExperiment data by row
#'
#' Purpose is to collapse rows of a `SummarizedExperiment` object,
#' where measurements for a given entity, usually a gene, are split
#' across multiple rows in the source data. The output of this function
#' should be measurements appropriately summarized to the gene level.
#'
#' The key arguments are `group_func_name`, and `data_transform`.
#' Note that data is inverse-transformed based upon `data_transform`,
#' prior to calculating group summary values defined by `group_func_name`.
#' The reason is to enable using `group_func_name="sum"` on normal
#' space abundance values, when input data has already been
#' transformed with `log2(1 + x)` for example. In this case it is most
#' appropriate to take the `sum` of normal space abundance values,
#' then to re-apply the transformation afterwards.
#'
#' However, when using `group_func_name="mean"` it is usually
#' recommended to use `data_transform="none"` so that data is maintained
#' in appropriately transformed state.
#'
#' The driving use case is proteomics mass spectrometry data,
#' where measurements are described in terms of peptide sequences,
#' with or without optional post-translational modification (PTM),
#' and the peptide sequences are annotated to a source protein or gene.
#' This function can be used to:
#'
#' * collapse peptide-PTM data to the peptide level
#' * collapse peptide data to the protein level
#'
#' In future it may be used to collapse multiple microarray probe
#' measurements to the gene level, although that process is more likely
#' to be useful and recommended after performing probe-level statistical
#' analysis.
#'
#' ### Proteomics mass spectrometry analysis
#'
#' For proteomics mass spectrometry data, proteins are inconsistently
#' fragmented into smaller peptides of varying sizes. The peptides are
#' usually separated on a chromatography column, from which aliquot
#' fractions are taken and measured by mass spectrometry. The total
#' signal derived from the original protein is therefore some combination
#' of the measured peptide parts.
#'
#' In some upstream data processing tools, such as Proteomics Discoverer,
#' and PEAKS, the peptide data may be annotated with observed
#' modification events (PTM). In this scenario, peptide measurements
#' are split across multiple rows of data, where each
#' row represents an observed combination of peptide and PTMs.
#'
#' ### Collapse methods
#'
#' It is fairly straightforward to observe peptide-PTM measurement
#' data is correlated with overall protein quantification, and that
#' the specific combination of peptide fragments may be inconsistent
#' across samples. That is, one may observe five peptides of protein A
#' in one sample, and may observe seven peptides of protein A in
#' another sample. The quantities of each peptide may be inconsistent,
#' due to variability in protein fragmentation across samples. However,
#' the general sum of peptide measurements is typically fairly stable
#' across samples, especially for proteins of moderate to high abundance
#' which are known to have stable abundance per cell.
#'
#' Choice of method to collapse measurements is not trivial, and is
#' therefore configurable. In general, proteomics abundances are
#' analyzed after `log2( 1 + x )` transformation. However, measurements
#' cannot be summed in log2 form, which would be equivalent to
#' multiplying measurements in normal form. Measurements can be summed
#' but only after exponentiating the data, for example the reciprocal
#' `( 2 ^ x ) - 1` is sufficient.
#'
#' @family jamses SE utilities
#'
#' @return `SummarizedExperiment` object with these changes:
#' * rows will be collapsed by `row_groups`, for each `assays(se)`
#' `numeric` matrix defined by `assay_names`. The collapse may
#' optionally apply a data transformation defined in
#' `data_transform` in order to apply an appropriate `numeric` summary
#' calculation.
#' * `rowData(se)` will also be collapsed by `shrinkDataFrame()` to
#' combine unique values from each row annotation.
#'
#' @param se `SummarizedExperiment`
#' @param rows `character` vector of `rows(se)` to use for analysis. When
#' `rows=NULL` the default is to use all `rows(se)`.
#' @param row_groups `character` vector representing groups of rows to
#' be combined.
#' @param assay_names `character` vector of `names(assays(se))` to use for
#' the collapse operation. When `assay_names=NULL` the default is to
#' use all `assays(se)`.
#' @param group_func_name `character` name of function used to aggregate
#' measurement data within `row_groups`.
#' * `sum` - takes the `sum()` of each value in the group. This option
#' should be used together with `data_transform` when there has been
#' any data transformation, so that the data is inverse-transformed
#' prior to calculating the `sum()`, after which data is re-transformed
#' to its original state. This method is appropriate for log2p
#' `log2(1 + x)` transformed abundance measurements for example.
#' * `mean` - calculates the mean value per group. Note that in this
#' case is it usually recommended not to define `data_transform`
#' so that values are averaged in the appropriately transformed
#' numeric space.
#' * `weighted.mean` - calculates `weighted.mean()` where weights `w`
#' are defined by the values used. This method may be appropriate and
#' effective with normal space abundance values derived from
#' proteomics mass spec quantitation.
#' * `geomean` - calculates geometric mean of values in each group.
#' * `none` -
#' @param rowStatsFunc `function` optional function used instead of
#' `group_func_name`.
#' @param rowDataColnames `character` subset of colnames in `rowData(se)`
#' to be retained in the output data. Multiple values are combined
#' usually by comma-delimited concatenation within `row_groups`,
#' therefore it may be beneficial to include only relevant columns
#' in that output.
#' @param keepNULLlevels `logical` indicating whether to drop unused
#' factor levels in `row_groups`, this argument is passed to
#' `jamba::rowGroupMeans()`.
#' @param delim `character` string indicating a delimiter.
#' @param data_transform `character` string indicating which
#' transformation was used when preparing the assay data. The
#' assumption is that all assays were transformed by this method.
#' During processing, data is inverse-transformed prior to applying
#' the `group_func_name` or `rowStatsFunc` if supplied. After that
#' function is applied, data is transformed using this function.
#' The purpose is to enable taking the `sum()` in proper measured
#' absolute units (in normal space for example) where relevant,
#' after which is original numeric transformation is re-applied.
#' @param verbose `logical` indicating whether to print verbose output.
#' @param ... additional arguments are passed to `jamba::rowGroupMeans()`.
#'
#' @export
se_collapse_by_row <- function
(se,
rows=rownames(se),
row_groups,
assay_names=NULL,
group_func_name=c("sum",
"mean",
"weighted.mean",
"geomean",
"none"),
rowStatsFunc=NULL,
rowDataColnames=NULL,
keepNULLlevels=FALSE,
delim="[ ]*[;,]+[ ]*",
data_transform=c("none",
"log2p+sqrt",
"log2+sqrt",
"log2p",
"log2"),
verbose=TRUE,
...)
{
## Purpose is to collapse rows of a SummarizedExperiment object, intended
## for proteomics data to combine per-peptide rows into per-protein rows
##
## SE is a SummarizedExperiment object
##
## rows is a vector of rownames(SE) or a subset of rows to include
##
## row_groups is a vector of length(rows) whose names are rownames(SE)
## that represents the row grouping. For example:
## rows <- c("Apoe_pep1", "Apoe_pep2", "Gapdh_pep1")
## row_groups <- c(Apoe_pep1="Apoe", Apoe_pep2="Apoe", Gapdh_pep1="Gapdh")
##
## signal a vector of names(assays(SE)) or subset to include in the
## operation, any names not included will be dropped from the resulting
## SummarizedExperiment object.
##
## group_func_name is a vector of character names of row stats functions,
## provided for convenience. If group_func_name is length=1, it is used
## for all signal values. Otherwise, group_func_name is expected to be
## a vector named by signal, allowing each signal to use a different
## row stats function appropriate to that signal type.
##
## rowStatsFunc is an optional custom function, or list of custom
## functions named by signal, intended to produce a single numeric
## value per row, for example rowMeans() or rowMedians(). It
## is sent to rowGroupMeans(...,rowStatsFunc=rowStatsFunc) and must
## accept matrix input, and perform row-wise operation to produce a
## single numeric value per row.
## Note that if a list is supplied, any names matching signal will
## cause that function to be used instead of the corresponding
## group_func_name function.
##
## keepNULLlevels=TRUE, if groups is a factor, then all factor levels
## will be maintained even if no values exist with that factor level.
## keepNULLlevels=FALSE, if groups is a factor, then only factor levels
## containing values will be returned.
##
## rowDataColnames is a vector of colnames(rowData(SE)) or subset, which
## should be included in the resulting collapsed rowData data.frame. It
## is useful to omit any column no longer necessary because it saves
## substantial space and time in this operation.
##
## data_transform describes how input expression values
## were transformed, so that they can be un-transformed
## before applying the math operation. Use "none" to perform
## no additional transformations before applying the math.
##
## data_transform "log2+sqrt" means data was transformed log2(1+sqrt(x))
## which means data was sqrt() transformed, then log2(1+x)
## transformed. Therefore, data will be exponentiated with
## (2^(x)-1)^2 which reverses the log2 transform with "(2^x)-1"
## then reverses the sqrt by raising to the power of 2 with "x^2".
##
if (!jamba::check_pkg_installed("SummarizedExperiment")) {
stop("se_collapse_by_row() requires SummarizedExperiment");
}
## Parameter validation
##
if (length(assay_names) == 0) {
assay_names <- names(SummarizedExperiment::assays(se));
} else {
assay_names <- intersect(assay_names, names(SummarizedExperiment::assays(se)));
}
if (length(assay_names) == 0) {
stop("se_collapse_by_row() needs assay_names in names(assays(se)).");
}
if (length(rowDataColnames) == 0) {
rowDataColnames <- colnames(SummarizedExperiment::rowData(se));
}
## group_func_name will become a vector of text or NA values, named by signal
group_func_args <- c("sum",
"mean",
"weighted.mean",
"geomean",
"none");
if (identical(group_func_args, group_func_name)) {
group_func_name <- head(group_func_name, 1);
}
if (length(group_func_name) > 0) {
if (length(names(group_func_name)) == 0) {
## If given group_func_name without names, assign names by signal
group_func_name <- rep(group_func_name, length.out=length(assay_names));
names(group_func_name) <- assay_names;
}
} else {
group_func_name <- NA;
}
group_func_name <- jamba::nameVector(group_func_name[assay_names],
assay_names);
group_func_name[group_func_name %in% c("NA","","none","custom")] <- NA;
## rowStatsFunc will become a list named by signal after this section
if (length(rowStatsFunc) > 0) {
if (jamba::igrepHas("function", class(rowStatsFunc))) {
if (length(group_func_name) == 0) {
customSignal <- assay_names;
} else {
customSignal <- assay_names[group_func_name %in%
c(NA, "", "none", "NA")];
}
rowStatsFunc <- lapply(jamba::nameVector(customSignal), function(i){
rowStatsFunc
});
} else if (!jamba::igrepHas("list", class(rowStatsFunc))) {
stop(paste0("rowStatsFunc must be NULL, a function,",
" or a list of functions"));
}
rowStatsFunc <- rowStatsFunc[assay_names];
names(rowStatsFunc) <- assay_names;
rowStatsFunc;
} else {
rowStatsFunc <- lapply(jamba::nameVector(assay_names), function(i)NULL);
}
## data.frame summary of which stats function to use per signal
statsFuncDF <- as.data.frame(jamba::rmNULL(
list(group_func_name=group_func_name,
rowStatsFunc=sapply(rowStatsFunc[assay_names], class))));
if (verbose) {
jamba::printDebug("se_collapse_by_row(): ",
"Performing these operations by assay_names:");
print(statsFuncDF);
}
###############################################################
## if supplied group_func_name, define the relevant function
## sum
rowSum <- function(x, ..., na.rm=TRUE){
rowSums(x,
...,
na.rm=na.rm)
}
## mean
rowMean <- function(x, ..., na.rm=TRUE){
rowMeans(x,
...,
na.rm=na.rm)
}
## weighted.mean
rowWeightedMean <- function(x, w=x, ..., na.rm=TRUE){
#isNA <- is.na(x);
#x <- x[!isNA];
#w[is.na(w)] <- 0;
apply(x, 1, function(x1){
weighted.mean(x=x1,
w=x1,
na.rm=TRUE,
...);
})
#rowWeightedMeans(x=x,
# w=w,
# ...,
# na.rm=TRUE);
}
# custom function geomean()
geomean <- function
(x,
na.rm=TRUE,
naValue=NA,
offset=0,
...)
{
## Purpose is to calculate the classical geometric mean
if (na.rm && any(is.na(x))) {
x <- jamba::rmNA(x,
naValue=naValue);
}
2^mean(log2(x + offset)) - offset;
}
# custom function rowGeomeans()
rowGeomeans <- function
(x,
na.rm=TRUE,
verbose=FALSE,
...)
{
## Purpose is to provide a wrapper to geomean(x, matrixBy="row", ...)
geomean(x=x,
matrixBy="row",
na.rm=na.rm,
verbose=verbose,
...)
}
###############################################################
## populate rowStatsFunc
for (iSignal in assay_names) {
if (statsFuncDF[iSignal,2] %in% "NULL") {
if (is.na(statsFuncDF[iSignal,1])) {
if (verbose) {
jamba::printDebug("se_collapse_by_row(): ",
"Dropping assay_names:",
iSignal,
" which has no stats function defined.",
fgText=c("yellow","red"));
}
assay_names <- setdiff(assay_names, iSignal);
} else {
if ("sum" %in% statsFuncDF[iSignal,1]) {
rowStatsFunc[[iSignal]] <- rowSum;
} else if ("mean" %in% statsFuncDF[iSignal,1]) {
rowStatsFunc[[iSignal]] <- rowMean;
} else if ("weighted.mean" %in% statsFuncDF[iSignal,1]) {
rowStatsFunc[[iSignal]] <- rowWeightedMean;
} else if ("geomean" %in% statsFuncDF[iSignal,1]) {
rowStatsFunc[[iSignal]] <- rowGeomeans;
} else {
if (verbose) {
jamba::printDebug("se_collapse_by_row(): ",
"Dropping assay_names:",
iSignal,
" unrecognized group_func_name:",
statsFuncDF[iSignal,1],
fgText=c("yellow","red"));
}
assay_names <- setdiff(assay_names, iSignal);
}
}
}
}
if (length(assay_names) == 0) {
stop("se_collapse_by_row() had no assay_names matching group_func_name or rowStatsFunc.");
}
###############################################################
# Ensure we use only rows which are present in rownames(se)
# and in names(row_groups)
if (length(names(row_groups)) == 0) {
if (length(row_groups) == length(rows)) {
names(row_groups) <- rows;
} else if (length(row_groups) == nrow(se)) {
names(row_groups) <- rownames(se);
}
}
rows <- intersect(rows, rownames(se));
rows <- intersect(rows, names(row_groups));
if (length(rows) == 0) {
stop("se_collapse_by_row() needs 'rows' present in rownames(se) and names(row_groups)");
}
row_groups <- row_groups[rows];
# any row_group with no value will not be collapsed,
# it will retain its original value
if (any(nchar(row_groups) == 0)) {
iRG <- (is.na(row_groups) | nchar(row_groups) == 0);
if (any(iRG)) {
row_groups[iRG] <- rows[iRG];
if (verbose) {
jamba::printDebug("se_collapse_by_row(): ",
"Note: Replaced ",
"row_groups[nchar(row_groups) == 0]",
" with the corresponding value from rows",
fgText=c("orange","red"));
}
#stop("se_collapse_by_row() row_groups contains some empty values, any(nchar(row_groups)=0)");
}
}
###############################################################
## rowDataColnames the colnames to keep in the new rowData
se_rowDataColnames <- colnames(SummarizedExperiment::rowData(se));
rowDataColnames <- intersect(rowDataColnames, se_rowDataColnames);
if (length(rowDataColnames) == 0 && length(se_rowDataColnames) > 0) {
rowDataColnames <- colnames(SummarizedExperiment::rowData(se));
}
#################################################################
## Iterate each signal and perform the collapse
## Any signals not included will be dropped from the resulting object
assaysGroupedL <- lapply(jamba::nameVector(assay_names), function(iSignal){
if (verbose) {
jamba::printDebug("se_collapse_by_row(): ",
"Collapsing assay_names:",
iSignal);
}
useStatsFunc <- rowStatsFunc[[iSignal]];
iMatrix <- SummarizedExperiment::assays(se[rows,])[[iSignal]];
if (data_transform %in% "log2p+sqrt") {
if (verbose) {
jamba::printDebug("se_collapse_by_row():",
"Applying de-transform: ",
"( 2 ^ x - 1 ) ^2");
}
## log2(1+sqrt(x))
iMatrix <- (2 ^ iMatrix - 1) ^ 2;
} else if (data_transform %in% "log2+sqrt") {
if (verbose) {
jamba::printDebug("se_collapse_by_row():",
"Applying de-transform: ",
"( 2 ^ x ) ^ 2");
}
## log2(sqrt(x))
iMatrix <- (2 ^ iMatrix) ^ 2;
} else if (data_transform %in% "log2p") {
## log2( 1 + x )
if (verbose) {
jamba::printDebug("se_collapse_by_row():",
"Applying de-transform: ",
"( 2 ^ x ) - 1");
}
iMatrix <- 2 ^ iMatrix - 1;
} else if (data_transform %in% "log2") {
## log2( x )
if (verbose) {
jamba::printDebug("se_collapse_by_row():",
"Applying de-transform: ",
"( 2 ^ x )");
}
iMatrix <- 2 ^ iMatrix;
} else {
if (verbose) {
jamba::printDebug("se_collapse_by_row():",
"No de-transform was applied.");
}
}
iMatrixGrp <- t(jamba::rowGroupMeans(
x=t(iMatrix),
groups=row_groups,
rowStatsFunc=useStatsFunc,
verbose=FALSE,
keepNULLlevels=keepNULLlevels,
...
));
# re-apply the appropriate transformation
if (data_transform %in% "log2p+sqrt") {
## log2(1+sqrt(x))
if (verbose) {
jamba::printDebug("se_collapse_by_row():",
"Applying re-transform: ",
"log2( 1 + sqrt(x) )");
}
iMatrixGrp <- log2( 1 + sqrt(iMatrixGrp));
} else if (data_transform %in% "log2+sqrt") {
## log2(sqrt(x))
if (verbose) {
jamba::printDebug("se_collapse_by_row():",
"Applying re-transform: ",
"log2( sqrt(x) )");
}
iMatrixGrp <- log2( 1 + sqrt(iMatrixGrp) );
} else if (data_transform %in% "log2p") {
## log2( 1 + x )
if (verbose) {
jamba::printDebug("se_collapse_by_row():",
"Applying re-transform: ",
"log2( 1 + x )");
}
iMatrixGrp <- log2( 1 + iMatrixGrp );
} else if (data_transform %in% "log2") {
## log2( x )
if (verbose) {
jamba::printDebug("se_collapse_by_row():",
"Applying re-transform: ",
"log2( x )");
}
iMatrixGrp <- log2( iMatrixGrp );
} else {
if (verbose) {
jamba::printDebug("se_collapse_by_row():",
"No transform was re-applied.");
}
}
# Revert zero to NA
# - review whether this option should be configurable
iMatrixGrp[iMatrixGrp == 0] <- NA;
iMatrixGrp;
});
# Now try to be clever and create a new rowData() which indicates
# the row groupings
if (verbose) {
jamba::printDebug("se_collapse_by_row(): ",
"Collapsing rowData(se)");
}
rowDataShrunk <- shrinkDataFrame(
x=SummarizedExperiment::rowData(se[rows,])[, rowDataColnames, drop=FALSE],
groupBy=row_groups,
include_num_reps=TRUE,
verbose=(verbose - 1) > 0,
...);
## Optional polish step to re-delimit column values to make them unique
if (length(delim) > 0) {
for (iCol in colnames(rowDataShrunk)) {
if (jamba::igrepHas("character", class(rowDataShrunk[[iCol]])) &&
jamba::igrepHas(delim, rowDataShrunk[[iCol]])) {
if (verbose) {
jamba::printDebug("se_collapse_by_row(): ",
"Re-delimiting unique values in column:",
iCol);
}
## split fields using delim, then call cPasteUnique()
## to paste only unique values back together
rowDataShrunk[[iCol]] <- jamba::cPasteU(
strsplit(rowDataShrunk[[iCol]], delim),
na.rm=TRUE);
}
}
}
if (verbose) {
jamba::printDebug("se_collapse_by_row(): ",
"Re-creating SummarizedExperiment");
}
se_shrunk <- SummarizedExperiment::SummarizedExperiment(
assays=assaysGroupedL,
rowData=rowDataShrunk,
colData=SummarizedExperiment::colData(se),
metadata=list(
genes=rownames(rowDataShrunk),
samples=colnames(se)));
#rownames(SEshrunk) <- rownames(rowDataShrunk);
se_shrunk;
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.