View source: R/aggregateReference.R

aggregateReference | R Documentation |

Aggregate reference samples for a given label by averaging their count profiles. This can be done with varying degrees of resolution to preserve the within-label heterogeneity.

```
aggregateReference(
ref,
labels,
ncenters = NULL,
power = 0.5,
ntop = 1000,
assay.type = "logcounts",
rank = 20,
subset.row = NULL,
check.missing = TRUE,
BPPARAM = SerialParam(),
BSPARAM = bsparam()
)
```

`ref` |
A numeric matrix of reference expression values, usually containing log-expression values. Alternatively, a SummarizedExperiment object containing such a matrix. |

`labels` |
A character vector or factor of known labels for all cells in |

`ncenters` |
Integer scalar specifying the maximum number of aggregated profiles to produce for each label. |

`power` |
Numeric scalar between 0 and 1 indicating how much aggregation should be performed, see Details. |

`ntop` |
Integer scalar specifying the number of highly variable genes to use for the PCA step. |

`assay.type` |
An integer scalar or string specifying the assay of |

`rank` |
Integer scalar specfiying the number of principal components to use during clustering. |

`subset.row` |
Integer, character or logical vector indicating the rows of |

`check.missing` |
Logical scalar indicating whether rows should be checked for missing values (and if found, removed). |

`BPPARAM` |
A BiocParallelParam object indicating how parallelization should be performed. |

`BSPARAM` |
A BiocSingularParam object indicating which SVD algorithm should be used in |

With single-cell reference datasets, it is often useful to aggregate individual cells into pseudo-bulk samples to serve as a reference.
This improves speed in downstream assignment with `classifySingleR`

or `SingleR`

.
The most obvious aggregation is to simply average all counts for all cells in a label to obtain a single pseudo-bulk profile.
However, this discards information about the within-label heterogeneity (e.g., the “shape” and spread of the population in expression space)
that may be informative for assignment, especially for closely related labels.

The default approach in this function is to create a series of pseudo-bulk samples to represent each label. This is achieved by performing vector quantization via k-means clustering on all cells in a particular label. Cells in each cluster are subsequently averaged to create one pseudo-bulk sample that serves as a representative for that location in the expression space. This reduces the number of separate observations (for speed) while preserving some level of population heterogeneity (for fidelity).

The number of pseudo-bulk samples per label is controlled by `ncenters`

.
By default, we set the number of clusters to `X^power`

where `X`

is the number of cells for that label.
This ensures that labels with more cells have more resolved representatives.
If `ncenters`

is greater than the number of samples for a label and/or `power=1`

, no aggregation is performed.
Setting `power=0`

will aggregate all cells of a label into a single pseudo-bulk profile.

In practice, k-means clustering is actually performed on the first `rank`

principal components as computed using `runPCA`

.
The use of PCs compacts the data for more efficient operation of `kmeans`

;
it also removes some of the high-dimensional noise to highlight major factors of within-label heterogenity.
Note that the PCs are only used for clustering and the full expression profiles are still used for the final averaging.
Users can disable the PCA step by setting `rank=Inf`

.

By default, we speed things up by only using the top `ntop`

genes with the largest variances in the PCA.
More subsetting of the matrix prior to the PCA can be achieved by setting `subset.row`

to an appropriate indexing vector.
This option may be useful for clustering based on known genes of interest but retaining all genes in the aggregated results.
(If both options are set, subsetting by `subset.row`

is done first, and then the top `ntop`

genes are selected.)
In both cases, though, the aggregation is performed on the full expression profiles.

We use the average rather than the sum in order to be compatible with `trainSingleR`

's internal marker detection.
Moreover, unlike counts, the sum of transformed and normalized expression values generally has little meaning.
We do not use the median to avoid consistently obtaining zeros for lowly expressed genes.

A SummarizedExperiment object with a `"logcounts"`

assay containing a matrix of aggregated expression values,
and a `label`

column metadata field specifying the label corresponding to each column.

Aaron Lun

```
library(scuttle)
sce <- mockSCE()
sce <- logNormCounts(sce)
# Making up some labels for demonstration purposes:
labels <- sample(LETTERS, ncol(sce), replace=TRUE)
# Aggregation at different resolutions:
(aggr <- aggregateReference(sce, labels, power=0.5))
(aggr <- aggregateReference(sce, labels, power=0))
# No aggregation:
(aggr <- aggregateReference(sce, labels, power=1))
```

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.