trainSingleR: Train the SingleR classifier

Description Usage Arguments Details Value Custom feature specification Dealing with multiple references Note on single-cell references Author(s) See Also Examples

View source: R/trainSingleR.R

Description

Train the SingleR classifier on one or more reference datasets with known labels.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
trainSingleR(
  ref,
  labels,
  genes = "de",
  sd.thresh = 1,
  de.method = c("classic", "wilcox", "t"),
  de.n = NULL,
  de.args = list(),
  aggr.ref = FALSE,
  aggr.args = list(),
  recompute = TRUE,
  restrict = NULL,
  assay.type = "logcounts",
  check.missing = TRUE,
  BNPARAM = KmknnParam()
)

Arguments

ref

A numeric matrix of expression values where rows are genes and columns are reference samples (individual cells or bulk samples). Each row should be named with the gene name. In general, the expression values are expected to be log-transformed, see Details.

Alternatively, a SummarizedExperiment object containing such a matrix.

Alternatively, a list or List of SummarizedExperiment objects or numeric matrices containing multiple references, in which case the row names are expected to be the same across all objects.

labels

A character vector or factor of known labels for all samples in ref.

Alternatively, if ref is a list, labels should be a list of the same length. Each element should contain a character vector or factor specifying the label for the corresponding entry of ref.

genes

A string specifying the feature selection method to be used, see Details.

Alternatively, if ref is not a list, genes can be either:

  • A list of lists of character vectors containing DE genes between pairs of labels.

  • A list of character vectors containing marker genes for each label.

If ref is a list, genes can be a list of length equal to ref. Each element of the list should be one of the two above choices described for non-list ref, containing markers for labels in the corresponding entry of ref.

sd.thresh

A numeric scalar specifying the minimum threshold on the standard deviation per gene. Only used when genes="sd".

de.method

String specifying how DE genes should be detected between pairs of labels. Defaults to "classic", which sorts genes by the log-fold changes and takes the top de.n. Setting to "wilcox" or "t" will use Wilcoxon ranked sum test or Welch t-test between labels, respectively, and take the top de.n upregulated genes per comparison.

de.n

An integer scalar specifying the number of DE genes to use when genes="de". If de.method="classic", defaults to 500 * (2/3) ^ log2(N) where N is the number of unique labels. Otherwise, defaults to 10.

de.args

Named list of additional arguments to pass to pairwiseTTests or pairwiseWilcox when de.method="wilcox" or "t".

aggr.ref

Logical scalar indicating whether references should be aggregated to pseudo-bulk samples for speed, see aggregateReference.

aggr.args

Further arguments to pass to aggregateReference when aggr.ref=TRUE.

recompute

Logical scalar indicating whether to set up indices for later recomputation of scores, when ref contains multiple references from which the individual results are to be combined. (See the difference between combineCommonResults and combineRecomputedResults.)

restrict

A character vector of gene names to use for marker selection. By default, all genes in ref are used.

assay.type

An integer scalar or string specifying the assay of ref containing the relevant expression matrix, if ref is a SummarizedExperiment object (or is a list that contains one or more such objects).

check.missing

Logical scalar indicating whether rows should be checked for missing values (and if found, removed).

BNPARAM

A BiocNeighborParam object specifying the algorithm to use for building nearest neighbor indices.

Details

This function uses a training data set to select interesting features and construct nearest neighbor indices in rank space. The resulting objects can be re-used across multiple classification steps with different test data sets via classifySingleR. This improves efficiency by avoiding unnecessary repetition of steps during the downstream analysis.

Several options are available for feature selection:

If genes="de" or "sd", the expression values are expected to be log-transformed and normalized.

If restrict is specified, ref is subsetted to only include the rows with names that are in restrict. Marker selection and all subsequent classification will be performed using this restrictive subset of genes. This can be convenient for ensuring that only appropriate genes are used (e.g., not pseudogenes or predicted genes).

Value

For a single reference, a List is returned containing:

common.genes:

A character vector of all genes that were chosen by the designated feature selection method.

nn.indices:

A List of BiocNeighborIndex objects containing pre-constructed neighbor search indices.

original.exprs:

A List of numeric matrices where each matrix contains all cells for a particular label.

search:

A List of additional information on the feature selection, for use by classifySingleR. This includes mode, a string containing the selection method; args, method-specific arguments that can be re-used during classification; and extras, method-specific structures that can be re-used during classification.

For multiple references, a List of Lists is returned where each internal List corresponds to a reference in ref and has the same structure as described above.

Custom feature specification

Rather than relying on the in-built feature selection, users can pass in their own features of interest to genes. The function expects a named list of named lists of character vectors, with each vector containing the DE genes between a pair of labels. For example:

1
2
3
4
5
genes <- list(
   A = list(A = character(0), B = "GENE_1", C = c("GENE_2", "GENE_3")),
   B = list(A = "GENE_100", B = character(0), C = "GENE_200"),
   C = list(A = c("GENE_4", "GENE_5"), B = "GENE_5", C = character(0))
)

If we consider the entry genes$A$B, this contains marker genes for label "A" against label "B". That is, these genes are upregulated in "A" compared to "B". The outer list should have one list per label, and each inner list should have one character vector per label. (Obviously, a label cannot have markers against itself, so this is just set to character(0).)

Alternatively, genes can be a named list of character vectors containing per-label markers. For example:

1
2
3
4
5
genes <- list(
     A = c("GENE_1", "GENE_2", "GENE_3"),
     B = c("GENE_100", "GENE_200"),
     C = c("GENE_4", "GENE_5")
)

The entry genes$A represent the genes that are upregulated in A compared to some or all other labels. This allows the function to handle pre-defined marker lists for specific cell populations. However, it obviously captures less information than marker sets for the pairwise comparisons.

If genes explicitly contains gene identities (as character vectors), ref can be the raw counts or any monotonic transformation thereof.

Dealing with multiple references

The default SingleR policy for dealing with multiple references is to perform the classification for each reference separately and combine the results (see ?combineRecomputedResults for an explanation). To this end, if ref is a list with multiple references, marker genes are identified separately within each reference when genes="de" or "sd". Rank calculation and index construction is then performed within each reference separately.

Alternatively, genes can still be used to explicitly specify marker genes for each label in each of multiple references. This is achieved by passing a list of lists to genes, where each inner list corresponds to a reference in ref and can be of any format described in “Custom feature specification”. Thus, it is possible for genes to be - wait for it - a list (per reference) of lists (per label) of lists (per label) of character vectors.

If recompute=TRUE, the output is exactly equivalent to running trainSingleR on each reference separately. If recompute=FALSE, trainSingleR is also run each reference but the difference is that the final common set of genes consists of the union of common genes across all references. This is necessary to ensure that correlations are computed from the same set of genes across reference and are thus reasonably comparable in combineCommonResults.

Note on single-cell references

The default marker selection is based on log-fold changes between the per-label medians and is very much designed with bulk references in mind. It may not be effective for single-cell reference data where it is not uncommon to have more than 50% zero counts for a given gene such that the median is also zero for each group. Users are recommended to either set de.method to another DE ranking method, or detect markers externally and pass a list of markers to genes (see Examples).

In addition, it is generally unnecessary to have single-cell resolution on the reference profiles. We can instead set aggr.ref=TRUE to aggregate per-cell references into a set of pseudo-bulk profiles using aggregateReference. This improves classification speed while using vector quantization to preserve within-label heterogeneity and mitigate the loss of information. Note that any aggregation is done after marker gene detection; this ensures that the relevant tests can appropriately penalize within-label variation. Users should also be sure to set the seed as the aggregation involves randomization.

Author(s)

Aaron Lun, based on the original SingleR code by Dvir Aran.

See Also

classifySingleR, where the output of this function gets used.

combineCommonResults and combineRecomputedResults, to combine results from multiple references.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Making up some data for a quick demonstration.
ref <- .mockRefData()

# Normalizing and log-transforming for automated marker detection.
ref <- scuttle::logNormCounts(ref)

trained <- trainSingleR(ref, ref$label)
trained
trained$nn.indices
length(trained$common.genes)

# Alternatively, computing and supplying a set of label-specific markers.
by.t <- scran::pairwiseTTests(assay(ref, 2), ref$label, direction="up")
markers <- scran::getTopMarkers(by.t[[1]], by.t[[2]], n=10)
trained <- trainSingleR(ref, ref$label, genes=markers)
length(trained$common.genes)

SingleR documentation built on Feb. 4, 2021, 2:01 a.m.