aggregateReference: Aggregate reference samples

View source: R/aggregateReference.R

aggregateReferenceR Documentation

Aggregate reference samples


Aggregate reference samples for a given label by averaging their count profiles. This can be done with varying degrees of resolution to preserve the within-label heterogeneity.


  ncenters = NULL,
  power = 0.5,
  ntop = 1000,
  assay.type = "logcounts",
  rank = 20,
  subset.row = NULL,
  check.missing = TRUE,
  BPPARAM = SerialParam(),
  BSPARAM = bsparam()



A numeric matrix of reference expression values, usually containing log-expression values. Alternatively, a SummarizedExperiment object containing such a matrix.


A character vector or factor of known labels for all cells in ref.


Integer scalar specifying the maximum number of aggregated profiles to produce for each label.


Numeric scalar between 0 and 1 indicating how much aggregation should be performed, see Details.


Integer scalar specifying the number of highly variable genes to use for the PCA step.


An integer scalar or string specifying the assay of ref containing the relevant expression matrix, if ref is a SummarizedExperiment object.


Integer scalar specfiying the number of principal components to use during clustering.


Integer, character or logical vector indicating the rows of ref to use for k-means clustering.


Logical scalar indicating whether rows should be checked for missing values (and if found, removed).


A BiocParallelParam object indicating how parallelization should be performed.


A BiocSingularParam object indicating which SVD algorithm should be used in runPCA.


With single-cell reference datasets, it is often useful to aggregate individual cells into pseudo-bulk samples to serve as a reference. This improves speed in downstream assignment with classifySingleR or SingleR. The most obvious aggregation is to simply average all counts for all cells in a label to obtain a single pseudo-bulk profile. However, this discards information about the within-label heterogeneity (e.g., the “shape” and spread of the population in expression space) that may be informative for assignment, especially for closely related labels.

The default approach in this function is to create a series of pseudo-bulk samples to represent each label. This is achieved by performing vector quantization via k-means clustering on all cells in a particular label. Cells in each cluster are subsequently averaged to create one pseudo-bulk sample that serves as a representative for that location in the expression space. This reduces the number of separate observations (for speed) while preserving some level of population heterogeneity (for fidelity).

The number of pseudo-bulk samples per label is controlled by ncenters. By default, we set the number of clusters to X^power where X is the number of cells for that label. This ensures that labels with more cells have more resolved representatives. If ncenters is greater than the number of samples for a label and/or power=1, no aggregation is performed. Setting power=0 will aggregate all cells of a label into a single pseudo-bulk profile.

In practice, k-means clustering is actually performed on the first rank principal components as computed using runPCA. The use of PCs compacts the data for more efficient operation of kmeans; it also removes some of the high-dimensional noise to highlight major factors of within-label heterogenity. Note that the PCs are only used for clustering and the full expression profiles are still used for the final averaging. Users can disable the PCA step by setting rank=Inf.

By default, we speed things up by only using the top ntop genes with the largest variances in the PCA. More subsetting of the matrix prior to the PCA can be achieved by setting subset.row to an appropriate indexing vector. This option may be useful for clustering based on known genes of interest but retaining all genes in the aggregated results. (If both options are set, subsetting by subset.row is done first, and then the top ntop genes are selected.) In both cases, though, the aggregation is performed on the full expression profiles.

We use the average rather than the sum in order to be compatible with trainSingleR's internal marker detection. Moreover, unlike counts, the sum of transformed and normalized expression values generally has little meaning. We do not use the median to avoid consistently obtaining zeros for lowly expressed genes.


A SummarizedExperiment object with a "logcounts" assay containing a matrix of aggregated expression values, and a label column metadata field specifying the label corresponding to each column.


Aaron Lun


sce <- mockSCE()
sce <- logNormCounts(sce)

# Making up some labels for demonstration purposes:
labels <- sample(LETTERS, ncol(sce), replace=TRUE)

# Aggregation at different resolutions:
(aggr <- aggregateReference(sce, labels, power=0.5))

(aggr <- aggregateReference(sce, labels, power=0))

# No aggregation:
(aggr <- aggregateReference(sce, labels, power=1))

LTLA/SingleR documentation built on July 30, 2022, 4:11 a.m.