combineResults: Combine SingleR results

Description Usage Arguments Details Value Method rationale Author(s) See Also Examples

View source: R/combineResults.R

Description

Combine results from multiple runs of classifySingleR (usually against different references) into a single DataFrame. The label from the results with the highest score for each cell is retained.

Usage

1
combineResults(results)

Arguments

results

A list of DataFrame prediction results as returned by classifySingleR when run on each reference separately.

Details

Labels are combined across results based on the highest score in each reference. Each result should be generated from training sets that use a common set of genes during classification, i.e., common.genes should be the same in the trained argument to each classifySingleR call. This is because the scores are not comparable across results if they were generated from different sets of genes.

It is unlikely that this method will be called directly by the end-user. Users are advised to use the multi-reference mode of SingleR, trainSingleR and/or classifySingleR, which will take care of the use of a common set of genes before calling this function to combine results across references.

If this function must be called manually, users should ensure that common.genes is the same for all calls used to generate results. This is most easily achieved by calling trainSingleR on each reference; replacing each common.genes with the union of all common.genes; and then calling classifySingleR on the test with the modified training objects. The resulting DataFrames can then be passed as results above.

Value

A DataFrame is returned containing the annotation statistics for each cell or cluster (row). This has the same fields as the output of classifySingleR, where the scores are combined across all results. The set of labels for each cell are those from the DataFrame with the largest maximum score. The original results are available in the orig.results field.

Method rationale

There are three obvious options for combining reference datasets or classification results stemming from disparate references:

Option 1 would be to combine the reference datasets into a single matrix and treat each label as though it is specific to the reference from which it originated (e.g. Ref1-Bcell vs Ref2-Bcell), which is easily accomplished by pasteing the reference name onto the corresponding set of labels. This option avoids the need for time-consuming label harmonization between references, and may be the best approach if the differences between the reference sets are important (e.g., different experimental conditions).

That said, the fact that we are comparing across references means that the marker set is likely to contain genes responsible for uninteresting batch effects. This will increase noise during the calculation of the score in each reference, possibly leading to a loss of precision and a greater risk of technical variation dominating the classification results.

Option 2 would also involve combining the reference datasets into a single matrix but would harmonize the labels so that the same cell type is given the same label across references. This would allow feature selection methods to identify robust sets of label-specific markers that are more likely to generalize to other datasets. It would also simplify interpretation, as there is no need to worry about the reference from which the labels came.

The main obstacle to this method is the diffculty of harmonization. Putting aside trivial differences in naming schemes (e.g. "B cell" vs "B"), we must resolve additional challenges like differences in label resolution across references (e.g., how to harmonize "B cell" to another reference that splits to "naive B cell" and "mature B cell"), different sorting strategies for obtaining pure cell types, or other subtle biological differences that require domain expertise.

Option 3 is the method that is implemented in this function. It involves performing classification separately within each reference, then collating the results to choose the label with the highest score across references. This is a relatively expedient approach that avoids the need for explicit harmonization while also reduces the potential for reference-specific markers.

It leaves a mixture of labels in the final results that is up to the user to resolve, though perhaps this may be considered a feature as it smoothly handles differences in resolution between references, e.g., a cell that cannot be resolved as a CD4+ or CD8+ T cell may simply fall back to "T cell". It will also be somewhat suboptimal if there are many reference-specific labels, as markers are not identified with the aim of distinguishing a label in one reference from another label in another reference.

Author(s)

Jared Andrews

See Also

matchReferences, to harmonize labels between reference datasets.

SingleR and classifySingleR, for generating predictions to use in results.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
##############################
## Mocking up training data ##
##############################

Ngroups <- 5
Ngenes <- 1000
means <- matrix(rnorm(Ngenes*Ngroups), nrow=Ngenes)
means[1:900,] <- 0
colnames(means) <- LETTERS[1:5]

g <- rep(LETTERS[1:5], each=4)
g2 <- rep(LETTERS[6:10], each=4)
ref1 <- SummarizedExperiment(
    list(counts=matrix(rpois(1000*length(g), 
    lambda=10*2^means[,g]), ncol=length(g))),
    colData=DataFrame(label=g)
)
ref2 <- SummarizedExperiment(
    list(counts=matrix(rpois(1000*length(g2), 
    lambda=10*2^means[,g]), ncol=length(g2))),
    colData=DataFrame(label=g2)
)
rownames(ref1) <- sprintf("GENE_%s", seq_len(nrow(ref1)))
rownames(ref2) <- sprintf("GENE_%s", seq_len(nrow(ref2)))

ref1 <- scater::logNormCounts(ref1)
ref2 <- scater::logNormCounts(ref2)

###############################
## Mocking up some test data ##
###############################

N <- 100
g <- sample(LETTERS[1:5], N, replace=TRUE)
means <- matrix(rnorm(Ngenes*Ngroups), nrow=Ngenes)
means[1:900] <- 0
colnames(means) <- LETTERS[1:5]
test <- SummarizedExperiment(
    list(counts=matrix(rpois(1000*N, lambda=2^means[,g]), ncol=N)),
    colData=DataFrame(label=g)
)

rownames(test) <- sprintf("GENE_%s", seq_len(nrow(test)))
test <- scater::logNormCounts(test)

###############################
## Performing classification ##
###############################

pred1 <- SingleR(test, ref1, labels=ref1$label)
pred2 <- SingleR(test, ref2, labels=ref2$label)

pred3 <- SingleR(test, ref1, labels=ref1$label, 
    method="cluster", clusters=test$label) 
pred4 <- SingleR(test, ref2, labels=ref2$label, 
    method="cluster", clusters=test$label) 

###############################
##     Combining results     ##
###############################

pred.single <- combineResults(list("pred1" = pred1, "pred2" = pred2))
pred.clust <- combineResults(list("pred3" = pred3, "pred4" = pred4))

SingleR documentation built on Jan. 9, 2020, 2:01 a.m.