trainSingleR: Train the SingleR classifier
In LTLA/SingleR: Reference-Based Single-Cell RNA-Seq Annotation

trainSingleR

R Documentation

Train the SingleR classifier

Description

Train the SingleR classifier on one or more reference datasets with known labels.

Usage

trainSingleR(
  ref,
  labels,
  test.genes = NULL,
  genes = "de",
  sd.thresh = NULL,
  de.method = c("classic", "wilcox", "t"),
  de.n = NULL,
  de.args = list(),
  aggr.ref = FALSE,
  aggr.args = list(),
  recompute = TRUE,
  restrict = NULL,
  assay.type = "logcounts",
  check.missing = TRUE,
  approximate = FALSE,
  num.threads = bpnworkers(BPPARAM),
  BNPARAM = NULL,
  BPPARAM = SerialParam()
)

Arguments

`ref`	A numeric matrix of expression values where rows are genes and columns are reference samples (individual cells or bulk samples). Each row should be named with the gene name. In general, the expression values are expected to be normalized and log-transformed, see Details. Alternatively, a SummarizedExperiment object containing such a matrix. Alternatively, a list or List of SummarizedExperiment objects or numeric matrices containing multiple references.
`labels`	A character vector or factor of known labels for all samples in `ref`. Alternatively, if `ref` is a list, `labels` should be a list of the same length. Each element should contain a character vector or factor specifying the labels for the columns of the corresponding element of `ref`.
`test.genes`	Character vector of the names of the genes in the test dataset, i.e., the row names of `test` in `classifySingleR`. If `NULL`, it is assumed that the test dataset and `ref` have the same genes in the same row order.
`genes`	A string containing `"de"`, indicating that markers should be calculated from `ref`. For back compatibility, other string values are allowed but will be ignored with a deprecation warning. Alternatively, if `ref` is not a list, `genes` can be either: A list of lists of character vectors containing DE genes between pairs of labels. A list of character vectors containing marker genes for each label. If `ref` is a list, `genes` can be a list of length equal to `ref`. Each element of the list should be one of the two above choices described for non-list `ref`, containing markers for labels in the corresponding entry of `ref`.
`sd.thresh`	Deprecated and ignored.
`de.method`	String specifying how DE genes should be detected between pairs of labels. Defaults to `"classic"`, which sorts genes by the log-fold changes and takes the top `de.n`. Other options are `"wilcox"` and `"t"`, see Details. Ignored if `genes` is a list of markers/DE genes.
`de.n`	An integer scalar specifying the number of DE genes to use when `genes="de"`. If `de.method="classic"`, defaults to `500 * (2/3) ^ log2(N)` where `N` is the number of unique labels. Otherwise, defaults to 10. Ignored if `genes` is a list of markers/DE genes.
`de.args`	Named list of additional arguments to pass to `scoreMarkers` when `de.method="wilcox"` or `"t"`. Ignored if `genes` is a list of markers/DE genes.
`aggr.ref`	Logical scalar indicating whether references should be aggregated to pseudo-bulk samples for speed, see `aggregateReference`.
`aggr.args`	Further arguments to pass to `aggregateReference` when `aggr.ref=TRUE`.
`recompute`	Deprecated and ignored.
`restrict`	A character vector of gene names to use for marker selection. By default, all genes in `ref` are used.
`assay.type`	An integer scalar or string specifying the assay of `ref` containing the relevant expression matrix, if `ref` is a SummarizedExperiment object (or is a list that contains one or more such objects).
`check.missing`	Logical scalar indicating whether rows should be checked for missing values. If true and any missing values are found, the rows containing these values are silently removed.
`approximate`	Deprecated, use `BNPARAM` instead.
`num.threads`	Integer scalar specifying the number of threads to use for index building.
`BNPARAM`	A BiocNeighborParam object specifying how the neighbor search index should be constructed.
`BPPARAM`	A BiocParallelParam object specifying how parallelization should be performed when `check.missing = TRUE`.

Details

This function uses a training data set to select interesting features and construct nearest neighbor indices in rank space. The resulting objects can be re-used across multiple classification steps with different test data sets via classifySingleR. This improves efficiency by avoiding unnecessary repetition of steps during the downstream analysis.

The automatic marker detection (genes="de") identifies genes that are differentially expressed between pairs of labels in the reference dataset. The expression values are expected to be log-transformed and normalized. For each pair of labels, the top de.n genes with strongest upregulation in one label are chosen as markers to distinguish it from the other label. The exact ranking depends on the de.method= argument:

The default de.method="classic" will use getClassicMarkers to compute the median expression for each label and each gene. Then, for each pair of labels, the top de.n genes with the largest positive differences are chosen as markers to distinguish the first label from the second. This is intended for reference datasets derived from bulk transcriptomic data (e.g., microarrays) with a high density of non-zero values. It is less effective for single-cell data, where it is not uncommon to have more than 50% zero counts for a given gene such that the median is also zero for each group.
de.method="wilcox" will rank genes based on the area under the curve (AUC) in each pairwise comparison between labels. The top de.n genes with the largest AUCs above 0.5 are chosen as markers for the first label compared to the second. This is analogous to ranking on significance in the Wilcoxon ranked sum test and is intended for use with single-cell data. The exact calculaton is performed using the scoreMarkers function.
de.method="t" will rank genes on the Cohen's d in each pairiwse comparison. The top de.n genes with the largest positive Cohen's d are chosen as markers for the first label compared to the second. This is roughly analogous to ranking on significance in the t-test and is faster than the AUC. The exact calculaton is performed using the scoreMarkers function.

Alternatively, users can detect markers externally and pass a list of markers to genes (see “Custom gene specification”).

Classification with classifySingleR assumes that the test dataset contains all marker genes that were detected from the reference. If the test and reference datasets do not have the same genes in the same order, we can set test.genes to the row names of the test dataset. This will instruct trainSingleR to only consider markers that are present in the test dataset. Any subsequent call to classifySingleR will also check that test.genes is consistent with rownames(test).

On a similar note, if restrict is specified, marker selection will only be performed using the specified subset of genes. This can be convenient for ignoring inappropriate genes like pseudogenes or predicted genes. It has the same effect as filtering out undesirable rows from ref prior to calling trainSingleR. Unlike test.genes, setting restrict does not introduce further checks on rownames(test) in classifySingleR.

Value

For a single reference, a List is returned containing:

built:: An external pointer to various indices in C++ space. Note that this cannot be serialized and should be removed prior to any saveRDS step.
ref:: The reference expression matrix. This may have fewer columns than the input ref if aggr.ref = TRUE.
markers:: A list containing unique, a character vector of all marker genes used in training; and full, a list of list of character vectors containing the markers for each pairwise comparison between labels.
labels:: A list containing unique, a character vector of all unique reference labels; and full, a character vector containing the assigned label for each column in ref.

For multiple references, a List of Lists is returned where each internal List corresponds to a reference in ref and has the same structure as described above.

Custom gene specification

Rather than relying on the in-built feature selection, users can pass in their own features of interest to genes. The function expects a named list of named lists of character vectors, with each vector containing the DE genes between a pair of labels. For example:

genes <- list(
   A = list(A = character(0), B = "GENE_1", C = c("GENE_2", "GENE_3")),
   B = list(A = "GENE_100", B = character(0), C = "GENE_200"),
   C = list(A = c("GENE_4", "GENE_5"), B = "GENE_5", C = character(0))
)

If we consider the entry genes$A$B, this contains marker genes for label "A" against label "B". That is, these genes are upregulated in "A" compared to "B". The outer list should have one list per label, and each inner list should have one character vector per label. (Obviously, a label cannot have markers against itself, so this is just set to character(0).)

Alternatively, genes can be a named list of character vectors containing per-label markers. For example:

genes <- list(
     A = c("GENE_1", "GENE_2", "GENE_3"),
     B = c("GENE_100", "GENE_200"),
     C = c("GENE_4", "GENE_5")
)

The entry genes$A represent the genes that are upregulated in A compared to some or all other labels. This allows the function to handle pre-defined marker lists for specific cell populations. However, it obviously captures less information than marker sets for the pairwise comparisons.

If genes is manually passed, ref can contain the raw counts or any monotonic transformation thereof. There is no need to supply (log-)normalized expression values for the benefit of the automatic marker detection. Similarly, for manual genes, the values of de.method, de.n and sd.thresh have no effect.

Check out the Examples to see how manual genes can be passed to trainSingleR.

Dealing with multiple references

The default SingleR policy for dealing with multiple references is to perform the classification for each reference separately and combine the results (see ?combineRecomputedResults for an explanation). To this end, if ref is a list with multiple references, marker genes are identified separately within each reference if genes = NULL. Rank calculation and index construction is then performed within each reference separately. The result is identical to lapplying over a list of references and runing trainSingleR on each reference.

Alternatively, genes can still be used to explicitly specify marker genes for each label in each of multiple references. This is achieved by passing a list of lists to genes, where each inner list corresponds to a reference in ref and can be of any format described in “Custom feature specification”. Thus, it is possible for genes to be - wait for it - a list (per reference) of lists (per label) of lists (per label) of character vectors.

Aggregating single-cell references

It is generally unnecessary to have single-cell resolution on the reference profiles. We can instead set aggr.ref=TRUE to aggregate per-cell references into a set of pseudo-bulk profiles using aggregateReference. This improves classification speed while using vector quantization to preserve within-label heterogeneity and mitigate the loss of information. Note that any aggregation is done after marker gene detection; this ensures that the relevant tests can appropriately penalize within-label variation. Users should also be sure to set the seed as the aggregation involves randomization.

Author(s)

Aaron Lun, based on the original SingleR code by Dvir Aran.

Examples

# Making up some data for a quick demonstration.
ref <- .mockRefData()

# Normalizing and log-transforming for automated marker detection.
ref <- scuttle::logNormCounts(ref)

trained <- trainSingleR(ref, ref$label)
trained
length(trained$markers$unique)

# Alternatively, supplying a custom set of markers from pairwise comparisons.
all.labels <- unique(ref$label)
custom.markers <- list()
for (x in all.labels) {
    current.markers <- lapply(all.labels, function(x) sample(rownames(ref), 20))
    names(current.markers) <- all.labels
    current.markers[[x]] <- character(0)
    custom.markers[[x]] <- current.markers
}
custom.trained <- trainSingleR(ref, ref$label, genes=custom.markers)

# Alternatively, supplying a custom set of markers for each label.
custom.markers <- list()
for (x in all.labels) {
    custom.markers[[x]] <- sample(rownames(ref), 20)
}
custom.trained <- trainSingleR(ref, ref$label, genes=custom.markers)

LTLA/SingleR documentation built on June 15, 2025, 4:13 a.m.