agglomerate-methods: Agglomerate or merge data using taxonomic information

agglomerate-methodsR Documentation

Agglomerate or merge data using taxonomic information

Description

Agglomeration functions can be used to sum-up data based on specific criteria such as taxonomic ranks, variables or prevalence.

Usage

agglomerateByRank(x, ...)

agglomerateByVariable(x, ...)

## S4 method for signature 'SummarizedExperiment'
agglomerateByRank(
  x,
  rank = taxonomyRanks(x)[1],
  onRankOnly = FALSE,
  na.rm = FALSE,
  empty.fields = c(NA, "", " ", "\t", "-", "_"),
  ...
)

## S4 method for signature 'SummarizedExperiment'
agglomerateByVariable(x, MARGIN, f, archetype = 1L, ...)

## S4 method for signature 'TreeSummarizedExperiment'
agglomerateByVariable(
  x,
  MARGIN,
  f,
  archetype = 1L,
  mergeTree = FALSE,
  mergeRefSeq = FALSE,
  ...
)

## S4 method for signature 'SingleCellExperiment'
agglomerateByRank(x, ..., altexp = NULL, strip_altexp = TRUE)

## S4 method for signature 'TreeSummarizedExperiment'
agglomerateByRank(
  x,
  ...,
  agglomerate.tree = agglomerateTree,
  agglomerateTree = FALSE
)

agglomerateByPrevalence(x, ...)

## S4 method for signature 'SummarizedExperiment'
agglomerateByPrevalence(
  x,
  rank = taxonomyRanks(x)[1L],
  other_label = "Other",
  ...
)

Arguments

x

a SummarizedExperiment or a TreeSummarizedExperiment

...

arguments passed to agglomerateByRank function for SummarizedExperiment objects, to agglomerateByVariable and sumCountsAcrossFeatures, to getPrevalence and getPrevalentTaxa and used in agglomeratebyPrevalence

  • remove_empty_ranksA single boolean value for selecting whether to remove those columns of rowData that include only NAs after agglomeration. (By default: remove_empty_ranks = FALSE)

  • make_uniqueA single boolean value for selecting whether to make rownames unique. (By default: make_unique = TRUE)

  • detectionDetection threshold for absence/presence. Either an absolute value compared directly to the values of x or a relative value between 0 and 1, if as_relative = FALSE.

  • prevalencePrevalence threshold (in 0 to 1). The required prevalence is strictly greater by default. To include the limit, set include_lowest to TRUE.

  • as.relativeLogical scalar: Should the detection threshold be applied on compositional (relative) abundances? (default: FALSE)

rank

a single character defining a taxonomic rank. Must be a value of taxonomyRanks() function.

onRankOnly

TRUE or FALSE: Should information only from the specified rank be used or from ranks equal and above? See details. (default: onRankOnly = FALSE)

na.rm

TRUE or FALSE: Should taxa with an empty rank be removed? Use it with caution, since empty entries on the selected rank will be dropped. This setting can be tweaked by defining empty.fields to your needs. (default: na.rm = TRUE)

empty.fields

a character value defining, which values should be regarded as empty. (Default: c(NA, "", " ", "\t")). They will be removed if na.rm = TRUE before agglomeration.

MARGIN

A character value for selecting if data is merged row-wise / for features ('rows') or column-wise / for samples ('cols'). Must be 'rows' or 'cols'.

f

A factor for merging. Must be the same length as nrow(x)/ncol(x). Rows/Cols corresponding to the same level will be merged. If length(levels(f)) == nrow(x)/ncol(x), x will be returned unchanged.

archetype

Of each level of f, which element should be regarded as the archetype and metadata in the columns or rows kept, while merging? This can be single integer value or an integer vector of the same length as levels(f). (Default: archetype = 1L, which means the first element encountered per factor level will be kept)

mergeTree

TRUE or FALSE: Should rowTree() also be merged? (Default: mergeTree = FALSE)

mergeRefSeq

TRUE or FALSE: Should a consensus sequence be calculated? If set to FALSE, the result from archetype is returned; If set to TRUE the result from DECIPHER::ConsensusSequence is returned. (Default: mergeRefSeq = FALSE)

altexp

String or integer scalar specifying an alternative experiment containing the input data.

strip_altexp

TRUE or FALSE: Should alternative experiments be removed prior to agglomeration? This prevents to many nested alternative experiments by default (default: strip_altexp = TRUE)

agglomerate.tree

TRUE or FALSE: should rowTree() also be agglomerated? (Default: agglomerate.tree = FALSE)

agglomerateTree

alias for agglomerate.tree.

other_label

A single character valued used as the label for the summary of non-prevalent taxa. (default: other_label = "Other")

Details

agglomerateByRank can be used to sum up data based on associations with certain taxonomic ranks, as defined in rowData. Only available taxonomyRanks can be used.

agglomerateByVariable merges data on rows or columns of a SummarizedExperiment as defined by a factor alongside the chosen dimension. This function allows agglomeration of data based on other variables than taxonomy ranks. Metadata from the rowData or colData are retained as defined by archetype. assay are agglomerated, i.e. summed up. If the assay contains values other than counts or absolute values, this can lead to meaningless values being produced.

Depending on the available taxonomic data and its structure, setting onRankOnly = TRUE has certain implications on the interpretability of your results. If no loops exist (loops meaning two higher ranks containing the same lower rank), the results should be comparable. You can check for loops using detectLoop.

Agglomeration sums up the values of assays at the specified taxonomic level. With certain assays, e.g. those that include binary or negative values, this summing can produce meaningless values. In those cases, consider performing agglomeration first, and then applying the transformation afterwards.

agglomerateByVariable works similarly to sumCountsAcrossFeatures. However, additional support for TreeSummarizedExperiment was added and science field agnostic names were used. In addition the archetype argument lets the user select how to preserve row or column data.

For merge data of assays the function from scuttle are used.

agglomerateByPrevalence sums up the values of assays at the taxonomic level specified by rank (by default the highest taxonomic level available) and selects the summed results that exceed the given population prevalence at the given detection level. The other summed values (below the threshold) are agglomerated in an additional row taking the name indicated by other_label (by default "Other").

Value

agglomerateByRank returns a taxonomically-agglomerated, optionally-pruned object of the same class as x while agglomerateByVariable returns an object of the same class as x with the specified entries merged into one entry in all relevant components.

agglomerateByPrevalence returns a taxonomically-agglomerated object of the same class as x and based on prevalent taxonomic results.

See Also

sumCountsAcrossFeatures

Examples


### Agglomerate data based on taxonomic information

data(GlobalPatterns)
# print the available taxonomic ranks
colnames(rowData(GlobalPatterns))
taxonomyRanks(GlobalPatterns)

# agglomerate at the Family taxonomic rank
x1 <- agglomerateByRank(GlobalPatterns, rank="Family")
## How many taxa before/after agglomeration?
nrow(GlobalPatterns)
nrow(x1)

# agglomerate the tree as well
x2 <- agglomerateByRank(GlobalPatterns, rank="Family",
                       agglomerate.tree = TRUE)
nrow(x2) # same number of rows, but
rowTree(x1) # ... different
rowTree(x2) # ... tree

# If assay contains binary or negative values, summing might lead to 
# meaningless values, and you will get a warning. In these cases, you might 
# want to do agglomeration again at chosen taxonomic level.
tse <- transformAssay(GlobalPatterns, method = "pa")
tse <- agglomerateByRank(tse, rank = "Genus")
tse <- transformAssay(tse, method = "pa")

# removing empty labels by setting na.rm = TRUE
sum(is.na(rowData(GlobalPatterns)$Family))
x3 <- agglomerateByRank(GlobalPatterns, rank="Family", na.rm = TRUE)
nrow(x3) # different from x2

# Because all the rownames are from the same rank, rownames do not include 
# prefixes, in this case "Family:". 
print(rownames(x3[1:3,]))

# To add them, use getTaxonomyLabels function.
rownames(x3) <- getTaxonomyLabels(x3, with_rank = TRUE)
print(rownames(x3[1:3,]))

# use 'remove_empty_ranks' to remove columns that include only NAs
x4 <- agglomerateByRank(GlobalPatterns, rank="Phylum", 
                        remove_empty_ranks = TRUE)
head(rowData(x4))

# If the assay contains NAs, you might want to consider replacing them,
# since summing-up NAs lead to NA
x5 <- GlobalPatterns
# Replace first value with NA
assay(x5)[1,1] <- NA
x6 <- agglomerateByRank(x5, "Kingdom")
head( assay(x6) )
# Replace NAs with 0. This is justified when we are summing-up counts.
assay(x5)[ is.na(assay(x5)) ] <- 0
x6 <- agglomerateByRank(x5, "Kingdom")
head( assay(x6) )

## Look at enterotype dataset...
data(enterotype)
## Print the available taxonomic ranks. Shows only 1 available rank,
## not useful for agglomerateByRank
taxonomyRanks(enterotype)

### Merge TreeSummarizedExperiments on rows and columns

data(esophagus)
esophagus
plot(rowTree(esophagus))
# get a factor for merging
f <- factor(regmatches(rownames(esophagus),
                       regexpr("^[0-9]*_[0-9]*",rownames(esophagus))))
merged <- agglomerateByVariable(esophagus, MARGIN = "rows", f, 
                                mergeTree = TRUE)
plot(rowTree(merged))
#
data(GlobalPatterns)
GlobalPatterns
merged <- agglomerateByVariable(GlobalPatterns, MARGIN = "cols", 
                                colData(GlobalPatterns)$SampleType)
merged
## Data can be aggregated based on prevalent taxonomic results
tse <- GlobalPatterns
tse <- agglomerateByPrevalence(tse,
                              rank = "Phylum",
                              detection = 1/100,
                              prevalence = 50/100,
                              as_relative = TRUE)

tse

# Here data is aggregated at the taxonomic level "Phylum". The five phyla
# that exceed the population prevalence threshold of 50/100 represent the 
# five first rows of the assay in the aggregated data. The sixth and last row
# named by default "Other" takes the summed up values of all the other phyla 
# that are below the prevalence threshold.

assay(tse)[,1:5]


FelixErnst/mia documentation built on May 15, 2024, 6:31 a.m.