make_ref_similarity_names: make_ref_similarity_names
In celaref: Single-cell RNAseq cell cluster labelling by reference

Description Usage Arguments Details Value Functions See Also Examples

View source: R/group_labelling_functions.r

Construct some sensible labels or the groups/clusters in the query dataset, based on similarity the reference dataset.

This is a more low level/customisable version of make_ref_similarity_names, (would usually use that instead). Suitable for rare cases to reuse an existing de_table.ref.marked object. Or use a de_table.ref.marked table with more than one dataset present (discoraged). Or to skip the reciprocal comparison step.

make_ref_similarity_names(de_table.test, de_table.ref, pval = 0.01,
  num_steps = 5, rankmetric = "TOP100_LOWER_CI_GTE1", n = 100)

make_ref_similarity_names_using_marked(de_table.ref.marked,
  de_table.recip.marked = NA, the_test_dataset = NA,
  the_ref_dataset = NA, pval = 0.01, num_steps = 5)

`de_table.test`	A differential expression table of the query experiment, as generated from `contrast_each_group_to_the_rest`
`de_table.ref`	A differential expression table of the reference dataset, as generated from `contrast_each_group_to_the_rest`
`pval`	Differences between the rescaled ranking distribution of 'top' genes on different reference groups are tested with a Mann-Whitney U test. If they are significantly different, only the top group(s) are reported. It isn't a simple cutoff threshold as it can change the number of similar groups reported. ie. A more stringent pval is more likely to decide that groups are similar - which would result in multiple group reporting, or no similarity at all. Unlikely that this parameter will ever need to change. Default = 0.01.
`num_steps`	After ranking reference groups according to median 'top' gene ranking, how many adjacent pairs to test for differences. Set to 1 to only compare each group to the next, or NA to perform an all-vs-all comparison. Setting too low may means it is possible to miss groups with some similarity to the reported matches (similar_non_match column)). Too high (or NA) with a large number of reference groups could be slow. Default = 5.
`rankmetric`	Specifiy ranking method used to pick the 'top' genes. The default 'TOP100_LOWER_CI_GTE1' picks genes from the top 100 overrepresented genes (ranked by inner 95 work best for distinct cell types (e.g. tissue sample.). 'TOP100_SIG' again picks from the top 100 ranked genes, but requires only statistical significance, 95 clusters (e.g. PBMCs).
`n`	For tweaking maximum returned genes from different ranking methods.
`de_table.ref.marked`	The output of `get_the_up_genes_for_all_possible_groups` for the contrast of interest.
`de_table.recip.marked`	Optional. The (reciprocal) output of `get_the_up_genes_for_all_possible_groups` with the test and reference datasets swapped. If omitted a reciprocal test will not be done. Default = NA.
`the_test_dataset`	Optional. A short meaningful name for the experiment. (Should match test_dataset column in de_table.marked). Only needed in a table of more than one dataset. Default = NA.
`the_ref_dataset`	Optional. A short meaningful name for the experiment. (Should match dataset column in de_table.marked). Only needed in a table of more than one dataset. Default = NA.

This function aims to report a) the top most similar reference group, if there's a clear frontrunner, b) A list of multiple similar groups if they have similar similarity, or c) 'No similarity', if there is none.

Each group is named according to the following rules. Testing for significant (smaller) differences with a one-directional Mann-Whitney U test on their rescaled ranks:

The first (as ranked by median rescaled rank) reference group is significantly more similar than the next: Report first only.
When comparing differences betwen groups stepwise ranked by median rescaled rank - no group is significantly different to its neighbour: Report no similarity
There's no significant differences in the stepwise comparisons of the first N reference groups - but there is a significant difference later on : Report multiple group similarity

There are some further heuristic caveats:

The distribution of top genes in the last (or only) match group is tested versus a theroetical random distribution around 0.5 (as reported in pval_vs_random column). If the distribution is not significantly above random (It is possible in edge cases where there is a skewed dataset and no/few matches), no similarity is reported. The significnat pval column is left intact.
The comparison is repeated reciprocally - reference groups vs the query groups. This helps sensitivity of heterogenous query groups - and investigating the reciprocal matches can be informative in these cases. If a query group doens't 'match' a reference group, but the reference group does match that query group - it is reported in the group label in brackets. e.g. c1:th_lymphocytes(tc_lympocytes). Its even possible if there was no match (and pval = NA) e.g. emphc2:(tc_lymphocytes)

The similarity is formatted into a group label. Where there are multiple similar groups, they're listed from most to least similar by their median ranks.

For instance, a query dataset of clusters c1, c2, c3 and c4 againsts a cell-type labelled reference datatset might get names like: E.g.

c1:macrophage
c2:endotheial|mesodermal
c3:no_similarity
c4:mesodermal(endothelial)

Function make_ref_similarity_names is a convenience wrapper function for make_ref_similarity_names_from_marked. It accepts two 'de_table' outputs of function contrast_each_group_to_the_rest to compare and handles running get_the_up_genes_for_all_possible_groups. Sister function make_ref_similarity_names_from_marked may (rarely) be of use if the de_table.marked object has already been created, or if reciprocal tests are not wanted.

A table of automagically-generated labels for each query group, given their similarity to reference groups.

The columns in this table:

test_group : Query group e.g. "c1"
shortlab : The cluster label described above e.g. "c1:macrophage"
pval : If there is a similarity flagged, this is the P-value from a Mann-Whitney U test from the last 'matched' group to the adjacent 'non-matched' group. Ie. If only one label in shortlab, this will be the first of the stepped_pvals, if there are 2, it will be the second. If there is 'no_similarity' this will be NA (Because there is no confidence in what is the most appropriate of the all non-significant stepped pvalues.).
stepped_pvals : P-values from Mann-Whitney U tests across adjacent pairs of reference groups ordered from most to least similar (ascending median rank). ie. 1st-2nd most similar first, 2nd-3rd, 3rd-4th e.t.c. The last value will always be NA (no more reference group). e.g. refA:8.44e-10,refB:2.37e-06,refC:0.000818,refD:0.435,refE:0.245,refF:NA
pval_to_random : P-value of test of median rank (of last matched reference group) < random, from binomial test on top gene ranks (being < 0.5).
matches : List of all reference groups that 'match', as described, except it also includes (rare) examples where pval_to_random is not significant. "|" delimited.
reciprocal_matches : List of all reference groups that flagged test group as a match when directon of comparison is reversed. (significant pval and pval_to_random). "|" delimited.
similar_non_match: This column lists any reference groups outside of shortlab that are not signifcantly different to a reported match group. Limited by num_steps, and will never find anything if num_steps==1. "|" delimited. Usually NA.
similar_non_match_detail : P-values for any details about similar_non_match results. These p-values will always be non-significant. E.g. "A > C (p=0.0214,n.s)". "|" delimited. Usually NA.
differences_within : This feild lists any pairs of reference groups in shortlab that are significantly different. "|" delimited. Usually NA.

make_ref_similarity_names_using_marked: Construct some sensible cluster labels, but using a premade marked table.

contrast_each_group_to_the_rest For preparing de_table input

get_the_up_genes_for_all_possible_groups To prepare the de_table.ref.marked input.

# Make input
# de_table.demo_query <- contrast_each_group_to_the_rest(demo_query_se, "demo_query")
# de_table.demo_ref   <- contrast_each_group_to_the_rest(demo_ref_se,   "demo_ref")

make_ref_similarity_names(de_table.demo_query, de_table.demo_ref)
make_ref_similarity_names(de_table.demo_query, de_table.demo_ref, num_steps=3)
make_ref_similarity_names(de_table.demo_query, de_table.demo_ref, num_steps=NA)


# Make input
# de_table.demo_query <- contrast_each_group_to_the_rest(demo_query_se, "demo_query")
# de_table.demo_ref   <- contrast_each_group_to_the_rest(demo_ref_se,   "demo_ref")

de_table.marked.query_vs_ref <- get_the_up_genes_for_all_possible_groups(
     de_table.demo_query, de_table.demo_ref) 
de_table.marked.reiprocal <- get_the_up_genes_for_all_possible_groups(
     de_table.demo_ref, de_table.demo_query)
     

make_ref_similarity_names_using_marked(de_table.marked.query_vs_ref, 
                                       de_table.marked.reiprocal)
                                       
make_ref_similarity_names_using_marked(de_table.marked.query_vs_ref)