aggregateToPseudoBulk: Aggregation of single-cell to pseudobulk data
In GabrielHoffman/dreamlet: Scalable differential expression analysis of single cell transcriptomics datasets with complex study designs

aggregateToPseudoBulk

R Documentation

Aggregation of single-cell to pseudobulk data

Description

Aggregation of single-cell to pseudobulk data. Adapted from muscat::aggregateData and has same syntax and results. But can be much faster for SingleCellExperiment backed by H5AD files using on-disk storage.

Usage

aggregateToPseudoBulk(
  x,
  assay = NULL,
  sample_id = NULL,
  cluster_id = NULL,
  fun = c("sum", "mean", "median", "prop.detected", "num.detected", "sem", "number"),
  scale = FALSE,
  verbose = TRUE,
  BPPARAM = SerialParam(progressbar = verbose),
  checkValues = TRUE,
  h5adBlockSizes = 1e+09
)

Arguments

`x`	a `SingleCellExperiment`.
`assay`	character string specifying the assay slot to use as input data. Defaults to the 1st available (`assayNames(x)[1]`).
`sample_id`	character string specifying which variable to use as sample id
`cluster_id`	character string specifying which variable to use as cluster id
`fun`	a character string. Specifies the function to use as summary statistic. Passed to `summarizeAssayByGroup2`.
`scale`	logical. Should pseudo-bulks be scaled with the effective library size & multiplied by 1M?
`verbose`	logical. Should information on progress be reported?
`BPPARAM`	a `BiocParallelParam` object specifying how aggregation should be parallelized.
`checkValues`	logical. Should we check that signal values are positive integers?
`h5adBlockSizes`	set the automatic block size block size (in bytes) for DelayedArray to read an H5AD file. Larger values use more memory but are faster.

Details

Adapted from muscat::aggregateData and has similar syntax and same results. This is much faster for SingleCellExperiment backed by H5AD files using DelayedMatrix because this summarizes counts using DelayedMatrixStats. But this function also includes optmizations for sparseMatrix used by Seurat by using sparseMatrixStats.

Keeps variables from colData() that are constant within sample_id. For example, sex will be constant for all cells from the same sample_id, so it is retained as a variable in the pseudobulk result. But number of expressed genes varies across cells within each sample_id, so it is dropped from colData(). Instead the mean value per cell type is stored in metadata(pb)$aggr_means, and these can be included in regression formulas downstream. In that case, the value of the covariates used per sample will depend on the cell type analyzed.

Value

a SingleCellExperiment.

Aggregation parameters (assay, by, fun, scaled) are stored in metadata()$agg_pars, where by = c(cluster_id, sample_id). The number of cells that were aggregated are accessible in int_colData()$n_cells.

Author(s)

Gabriel Hoffman, Helena L Crowell & Mark D Robinson

References

Crowell, HL, Soneson, C, Germain, P-L, Calini, D, Collin, L, Raposo, C, Malhotra, D & Robinson, MD: Muscat detects subpopulation-specific state transitions from multi-sample multi-condition single-cell transcriptomics data. Nature Communications 11(1):6077 (2020). doi: https://doi.org/10.1038/s41467-020-19894-4

Examples

library(muscat)
library(SingleCellExperiment)

data(example_sce)

# create pseudobulk for each sample and cell cluster
pb <- aggregateToPseudoBulk(example_sce,
  assay = "counts",
  cluster_id = "cluster_id",
  sample_id = "sample_id",
  verbose = FALSE
)

# pseudobulk data from each cell type
# is stored as its own assay
pb

# aggregate by cluster only,
# collapsing all samples into the same pseudobulk
pb2 <- aggregateToPseudoBulk(example_sce, 
 cluster_id = "cluster_id", 
 verbose = FALSE)

pb2
#

GabrielHoffman/dreamlet documentation built on April 5, 2025, 5:07 a.m.