View source: R/aggregateReference.R
aggregateReference | R Documentation |
Aggregate reference samples for a given label by averaging their count profiles. This can be done with varying degrees of resolution to preserve the within-label heterogeneity.
aggregateReference(
ref,
labels,
ncenters = NULL,
power = 0.5,
ntop = 1000,
assay.type = "logcounts",
rank = 20,
subset.row = NULL,
check.missing = TRUE,
num.threads = bpnworkers(BPPARAM),
BPPARAM = SerialParam(),
BSPARAM = NULL
)
ref |
A numeric matrix of reference expression values, usually containing log-expression values. Alternatively, a SummarizedExperiment object containing such a matrix. |
labels |
A character vector or factor of known labels for all cells in |
ncenters |
Integer scalar specifying the maximum number of aggregated profiles to produce for each label. |
power |
Numeric scalar between 0 and 1 indicating how much aggregation should be performed, see Details. |
ntop |
Integer scalar specifying the number of highly variable genes to use for the PCA step. |
assay.type |
An integer scalar or string specifying the assay of |
rank |
Integer scalar specfiying the number of principal components to use during clustering. |
subset.row |
Integer, character or logical vector indicating the rows of |
check.missing |
Logical scalar indicating whether rows should be checked for missing values (and if found, removed). |
num.threads |
Integer scalar specifying the number to threads to use. |
BPPARAM |
Deprecated, use |
BSPARAM |
Deprecated and ignored. |
With single-cell reference datasets, it is often useful to aggregate individual cells into pseudo-bulk samples to serve as a reference.
This improves speed in downstream assignment with classifySingleR
or SingleR
.
The most obvious aggregation is to simply average all counts for all cells in a label to obtain a single pseudo-bulk profile.
However, this discards information about the within-label heterogeneity (e.g., the “shape” and spread of the population in expression space)
that may be informative for assignment, especially for closely related labels.
The default approach in this function is to create a series of pseudo-bulk samples to represent each label. This is achieved by performing vector quantization via k-means clustering on all cells in a particular label. Cells in each cluster are subsequently averaged to create one pseudo-bulk sample that serves as a representative for that location in the expression space. This reduces the number of separate observations (for speed) while preserving some level of population heterogeneity (for fidelity).
The number of pseudo-bulk samples per label is controlled by ncenters
.
By default, we set the number of clusters to X^power
where X
is the number of cells for that label.
This ensures that labels with more cells have more resolved representatives.
If ncenters
is greater than the number of samples for a label and/or power=1
, no aggregation is performed.
Setting power=0
will aggregate all cells of a label into a single pseudo-bulk profile.
In practice, k-means clustering is actually performed on the first rank
principal components as computed using runPca
.
The use of PCs compacts the data for more efficient operation of clusterKmeans
;
it also removes some of the high-dimensional noise to highlight major factors of within-label heterogenity.
Note that the PCs are only used for clustering and the full expression profiles are still used for the final averaging.
Users can disable the PCA step by setting rank=Inf
.
By default, we speed things up by only using the top ntop
genes with the largest variances in the PCA, as identified with modelGeneVariances
.
More subsetting of the matrix prior to the PCA can be achieved by setting subset.row
to an appropriate indexing vector.
This option may be useful for clustering based on known genes of interest but retaining all genes in the aggregated results.
(If both options are set, subsetting by subset.row
is done first, and then the top ntop
genes are selected.)
In both cases, though, the aggregation is performed on the full expression profiles.
We use the average rather than the sum in order to be compatible with trainSingleR
's internal marker detection.
Moreover, unlike counts, the sum of transformed and normalized expression values generally has little meaning.
We do not use the median to avoid consistently obtaining zeros for lowly expressed genes.
A SummarizedExperiment object with a "logcounts"
assay containing a matrix of aggregated expression values,
and a label
column metadata field specifying the label corresponding to each column.
Aaron Lun
library(scuttle)
sce <- mockSCE()
sce <- logNormCounts(sce)
# Making up some labels for demonstration purposes:
labels <- sample(LETTERS, ncol(sce), replace=TRUE)
# Aggregation at different resolutions:
(aggr <- aggregateReference(sce, labels, power=0.5))
(aggr <- aggregateReference(sce, labels, power=0))
# No aggregation:
(aggr <- aggregateReference(sce, labels, power=1))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.