agglomerate-methods: Agglomerate data using taxonomic information or other...
In FelixErnst/mia: Microbiome analysis

agglomerateByRank

R Documentation

Agglomerate data using taxonomic information or other grouping

Description

Agglomeration functions can be used to sum-up data based on specific criteria such as taxonomic ranks, variables or prevalence.

agglomerateByRank can be used to sum up data based on associations with certain taxonomic ranks, as defined in rowData. Only available taxonomyRanks can be used.

agglomerateByVariable merges data on rows or columns of a SummarizedExperiment as defined by a factor alongside the chosen dimension. This function allows agglomeration of data based on other variables than taxonomy ranks. Metadata from the rowData or colData are retained as defined by archetype. assay are agglomerated, i.e. summed up. If the assay contains values other than counts or absolute values, this can lead to meaningless values being produced.

agglomerateByRanks takes a SummarizedExperiment, splits it along the taxonomic ranks, aggregates the data per rank, converts the input to a SingleCellExperiment objects and stores the aggregated data as alternative experiments. unsplitByRanks takes these alternative experiments and flattens them again into a single SummarizedExperiment.

Usage

agglomerateByRank(x, ...)

agglomerateByVariable(x, ...)

agglomerateByRanks(x, ...)

unsplitByRanks(x, ...)

## S4 method for signature 'TreeSummarizedExperiment'
agglomerateByRank(
  x,
  rank = taxonomyRanks(x)[1],
  update.tree = agglomerateTree,
  agglomerate.tree = agglomerateTree,
  agglomerateTree = TRUE,
  ...
)

## S4 method for signature 'SingleCellExperiment'
agglomerateByRank(
  x,
  rank = taxonomyRanks(x)[1],
  altexp = NULL,
  altexp.rm = strip_altexp,
  strip_altexp = TRUE,
  ...
)

## S4 method for signature 'SummarizedExperiment'
agglomerateByRank(
  x,
  rank = taxonomyRanks(x)[1],
  empty.rm = TRUE,
  empty.fields = c(NA, "", " ", "\t", "-", "_"),
  ...
)

## S4 method for signature 'TreeSummarizedExperiment'
agglomerateByVariable(
  x,
  by,
  group = f,
  f,
  update.tree = mergeTree,
  mergeTree = TRUE,
  ...
)

## S4 method for signature 'SummarizedExperiment'
agglomerateByVariable(x, by, group = f, f, ...)

## S4 method for signature 'SummarizedExperiment'
agglomerateByRanks(
  x,
  ranks = taxonomyRanks(x),
  na.rm = TRUE,
  as.list = FALSE,
  ...
)

## S4 method for signature 'SingleCellExperiment'
agglomerateByRanks(
  x,
  ranks = taxonomyRanks(x),
  na.rm = TRUE,
  as.list = FALSE,
  ...
)

## S4 method for signature 'TreeSummarizedExperiment'
agglomerateByRanks(
  x,
  ranks = taxonomyRanks(x),
  na.rm = TRUE,
  as.list = FALSE,
  ...
)

splitByRanks(x, ...)

## S4 method for signature 'SingleCellExperiment'
unsplitByRanks(
  x,
  ranks = taxonomyRanks(x),
  keep.dimred = keep_reducedDims,
  keep_reducedDims = FALSE,
  ...
)

## S4 method for signature 'TreeSummarizedExperiment'
unsplitByRanks(
  x,
  ranks = taxonomyRanks(x),
  keep.dimred = keep_reducedDims,
  keep_reducedDims = FALSE,
  ...
)

Arguments

`x`	`TreeSummarizedExperiment`.
`...`	arguments passed to `agglomerateByRank` function for `SummarizedExperiment` objects and other functions. See `agglomerateByRank` for more details.
`rank`	`Character scalar`. Defines a taxonomic rank. Must be a value of `taxonomyRanks()` function.
`update.tree`	`Logical scalar`. Should `rowTree()` also be merged? (Default: `TRUE`)
`agglomerate.tree`	Deprecated. Use `update.tree` instead.
`agglomerateTree`	Deprecated. Use `update.tree` instead.
`altexp`	`Character scalar` or `integer scalar`. Specifies an alternative experiment containing the input data.
`altexp.rm`	`Logical scalar`. Should alternative experiments be removed prior to agglomeration? This prevents too many nested alternative experiments by default. (Default: `TRUE`)
`strip_altexp`	Deprecated. Use `altexp.rm` instead.
`empty.rm`	`Logical scalar`. Defines whether rows including `empty.fields` in specified `rank` will be excluded. (Default: `TRUE`)
`empty.fields`	`Character vector`. Defines which values should be regarded as empty. (Default: `c(NA, "", " ", "\t")`). They will be removed if `na.rm = TRUE` before agglomeration.
`by`	`Character scalar`. Determines if data is merged row-wise / for features ('rows') or column-wise / for samples ('cols'). Must be `'rows'` or `'cols'`.
`group`	`Character scalar`, `character vector` or `factor vector`. A column name from `rowData(x)` or `colData(x)` or alternatively a vector specifying how the merging is performed. If vector, the value must be the same length as `nrow(x)/ncol(x)`. Rows/Cols corresponding to the same level will be merged. If `length(levels(group)) == nrow(x)/ncol(x)`, `x` will be returned unchanged.
`f`	Deprecated. Use `group` instead.
`mergeTree`	Deprecated. Use `update.tree` instead.
`ranks`	`Character vector`. Defines taxonomic ranks. Must all be values of `taxonomyRanks()` function.
`na.rm`	`Logical scalar`. Should NA values be omitted? (Default: `TRUE`)
`as.list`	`Logical scalar`. Should the list of `SummarizedExperiment` objects be returned by the function `agglomerateByRanks` as a SimpleList or stored in altExps? (Default: `FALSE`)
`keep.dimred`	`Logical scalar`. Should the `reducedDims(x)` be transferred to the result? Please note, that this breaks the link between the data used to calculate the reduced dims. (Default: `FALSE`)
`keep_reducedDims`	Deprecated. Use `keep.dimred` instead.

Details

Agglomeration sums up the values of assays at the specified taxonomic level. With certain assays, e.g. those that include binary or negative values, this summing can produce meaningless values. In those cases, consider performing agglomeration first, and then applying the transformation afterwards.

agglomerateByVariable works similarly to sumCountsAcrossFeatures. However, additional support for TreeSummarizedExperiment was added and science field agnostic names were used. In addition the archetype argument lets the user select how to preserve row or column data.

For merge data of assays the function from scuttle are used.

agglomerateByRanks will use by default all available taxonomic ranks, but this can be controlled by setting ranks manually. NA values are removed by default, since they would not make sense, if the result should be used for unsplitByRanks at some point. The input data remains unchanged in the returned SingleCellExperiment objects.

unsplitByRanks will remove any NA value on each taxonomic rank so that no ambiguous data is created. In additional, a column taxonomicLevel is created or overwritten in the rowData to specify from which alternative experiment this originates from. This can also be used for splitAltExps to split the result along the same factor again. The input data from the base objects is not returned, only the data from the altExp(). Be aware that changes to rowData of the base object are not returned, whereas only the colData of the base object is kept.

Value

agglomerateByRank returns a taxonomically-agglomerated, optionally-pruned object of the same class as x. agglomerateByVariable returns an object of the same class as x with the specified entries merged into one entry in all relevant components. agglomerateByRank returns a taxonomically-agglomerated, optionally-pruned object of the same class as x.

For agglomerateByRanks: If as.list = TRUE : SummarizedExperiment objects in a SimpleList If as.list = FALSE : The SummarizedExperiment passed as a parameter and now containing the SummarizedExperiment objects in its altExps

For unsplitByRanks: x, with rowData and assay data replaced by the unsplit data. colData of x is kept as well and any existing rowTree is dropped as well, since existing rowLinks are not valid anymore.

Examples


### Agglomerate data based on taxonomic information

data(GlobalPatterns)
# print the available taxonomic ranks
colnames(rowData(GlobalPatterns))
taxonomyRanks(GlobalPatterns)

# agglomerate at the Family taxonomic rank
x1 <- agglomerateByRank(GlobalPatterns, rank="Family")
## How many taxa before/after agglomeration?
nrow(GlobalPatterns)
nrow(x1)

# Do not agglomerate the tree
x2 <- agglomerateByRank(
    GlobalPatterns, rank="Family", update.tree = FALSE)
nrow(x2) # same number of rows, but
rowTree(x1) # ... different
rowTree(x2) # ... tree

# If assay contains binary or negative values, summing might lead to
# meaningless values, and you will get a warning. In these cases, you might
# want to do agglomeration again at chosen taxonomic level.
tse <- transformAssay(GlobalPatterns, method = "pa")
tse <- agglomerateByRank(tse, rank = "Genus")
tse <- transformAssay(tse, method = "pa")

# Removing empty labels by setting empty.rm = TRUE
sum(is.na(rowData(GlobalPatterns)$Family))
x3 <- agglomerateByRank(GlobalPatterns, rank="Family", empty.rm = TRUE)
nrow(x3) # different from x2

# Because all the rownames are from the same rank, rownames do not include
# prefixes, in this case "Family:".
print(rownames(x3[1:3,]))

# To add them, use getTaxonomyLabels function.
rownames(x3) <- getTaxonomyLabels(x3, with.rank = TRUE)
print(rownames(x3[1:3,]))

# use 'empty.ranks.rm' to remove columns that include only NAs
x4 <- agglomerateByRank(
    GlobalPatterns, rank="Phylum", empty.ranks.rm = TRUE)
head(rowData(x4))

# If the assay contains NAs, you might want to specify na.rm=TRUE,
# since summing-up NAs lead to NA
x5 <- GlobalPatterns
# Replace first value with NA
assay(x5)[1,1] <- NA
x6 <- agglomerateByRank(x5, "Kingdom")
head( assay(x6) )
# Use na.rm=TRUE
x6 <- agglomerateByRank(x5, "Kingdom", na.rm = TRUE)
head( assay(x6) )

## Look at enterotype dataset...
data(enterotype)
## Print the available taxonomic ranks. Shows only 1 available rank,
## not useful for agglomerateByRank
taxonomyRanks(enterotype)

### Merge TreeSummarizedExperiments on rows and columns

data(esophagus)
esophagus
plot(rowTree(esophagus))
# Get a factor for merging
f <- factor(regmatches(rownames(esophagus),
    regexpr("^[0-9]*_[0-9]*",rownames(esophagus))))
merged <- agglomerateByVariable(
    esophagus, by = "rows", f, update.tree = TRUE)
plot(rowTree(merged))
#
data(GlobalPatterns)
GlobalPatterns
merged <- agglomerateByVariable(
    GlobalPatterns, by = "cols", colData(GlobalPatterns)$SampleType)
merged

data(GlobalPatterns)
# print the available taxonomic ranks
taxonomyRanks(GlobalPatterns)

# agglomerateByRanks
# 
tse <- agglomerateByRanks(GlobalPatterns)
altExps(tse)
altExp(tse,"Kingdom")
altExp(tse,"Species")

# unsplitByRanks
tse <- unsplitByRanks(tse)
tse

FelixErnst/mia documentation built on July 16, 2025, 8:08 p.m.