make_ref_similarity_names: make_ref_similarity_names

Description Usage Arguments Details Value Functions See Also Examples

Description

Construct some sensible labels or the groups/clusters in the query dataset, based on similarity the reference dataset.

This is a more low level/customisable version of make_ref_similarity_names, (would usually use that instead). Suitable for rare cases to reuse an existing de_table.ref.marked object. Or use a de_table.ref.marked table with more than one dataset present (discoraged). Or to skip the reciprocal comparison step.

Usage

1
2
3
4
5
6
make_ref_similarity_names(de_table.test, de_table.ref, pval = 0.01,
  num_steps = 5, rankmetric = "TOP100_LOWER_CI_GTE1", n = 100)

make_ref_similarity_names_using_marked(de_table.ref.marked,
  de_table.recip.marked = NA, the_test_dataset = NA,
  the_ref_dataset = NA, pval = 0.01, num_steps = 5)

Arguments

de_table.test

A differential expression table of the query experiment, as generated from contrast_each_group_to_the_rest

de_table.ref

A differential expression table of the reference dataset, as generated from contrast_each_group_to_the_rest

pval

Differences between the rescaled ranking distribution of 'top' genes on different reference groups are tested with a Mann-Whitney U test. If they are significantly different, only the top group(s) are reported. It isn't a simple cutoff threshold as it can change the number of similar groups reported. ie. A more stringent pval is more likely to decide that groups are similar - which would result in multiple group reporting, or no similarity at all. Unlikely that this parameter will ever need to change. Default = 0.01.

num_steps

After ranking reference groups according to median 'top' gene ranking, how many adjacent pairs to test for differences. Set to 1 to only compare each group to the next, or NA to perform an all-vs-all comparison. Setting too low may means it is possible to miss groups with some similarity to the reported matches (similar_non_match column)). Too high (or NA) with a large number of reference groups could be slow. Default = 5.

rankmetric

Specifiy ranking method used to pick the 'top' genes. The default 'TOP100_LOWER_CI_GTE1' picks genes from the top 100 overrepresented genes (ranked by inner 95 work best for distinct cell types (e.g. tissue sample.). 'TOP100_SIG' again picks from the top 100 ranked genes, but requires only statistical significance, 95 clusters (e.g. PBMCs).

n

For tweaking maximum returned genes from different ranking methods.

de_table.ref.marked

The output of get_the_up_genes_for_all_possible_groups for the contrast of interest.

de_table.recip.marked

Optional. The (reciprocal) output of get_the_up_genes_for_all_possible_groups with the test and reference datasets swapped. If omitted a reciprocal test will not be done. Default = NA.

the_test_dataset

Optional. A short meaningful name for the experiment. (Should match test_dataset column in de_table.marked). Only needed in a table of more than one dataset. Default = NA.

the_ref_dataset

Optional. A short meaningful name for the experiment. (Should match dataset column in de_table.marked). Only needed in a table of more than one dataset. Default = NA.

Details

This function aims to report a) the top most similar reference group, if there's a clear frontrunner, b) A list of multiple similar groups if they have similar similarity, or c) 'No similarity', if there is none.

Each group is named according to the following rules. Testing for significant (smaller) differences with a one-directional Mann-Whitney U test on their rescaled ranks:

  1. The first (as ranked by median rescaled rank) reference group is significantly more similar than the next: Report first only.

  2. When comparing differences betwen groups stepwise ranked by median rescaled rank - no group is significantly different to its neighbour: Report no similarity

  3. There's no significant differences in the stepwise comparisons of the first N reference groups - but there is a significant difference later on : Report multiple group similarity

There are some further heuristic caveats:

  1. The distribution of top genes in the last (or only) match group is tested versus a theroetical random distribution around 0.5 (as reported in pval_vs_random column). If the distribution is not significantly above random (It is possible in edge cases where there is a skewed dataset and no/few matches), no similarity is reported. The significnat pval column is left intact.

  2. The comparison is repeated reciprocally - reference groups vs the query groups. This helps sensitivity of heterogenous query groups - and investigating the reciprocal matches can be informative in these cases. If a query group doens't 'match' a reference group, but the reference group does match that query group - it is reported in the group label in brackets. e.g. c1:th_lymphocytes(tc_lympocytes). Its even possible if there was no match (and pval = NA) e.g. emphc2:(tc_lymphocytes)

The similarity is formatted into a group label. Where there are multiple similar groups, they're listed from most to least similar by their median ranks.

For instance, a query dataset of clusters c1, c2, c3 and c4 againsts a cell-type labelled reference datatset might get names like: E.g.

Function make_ref_similarity_names is a convenience wrapper function for make_ref_similarity_names_from_marked. It accepts two 'de_table' outputs of function contrast_each_group_to_the_rest to compare and handles running get_the_up_genes_for_all_possible_groups. Sister function make_ref_similarity_names_from_marked may (rarely) be of use if the de_table.marked object has already been created, or if reciprocal tests are not wanted.

Value

A table of automagically-generated labels for each query group, given their similarity to reference groups.

The columns in this table:

Functions

See Also

contrast_each_group_to_the_rest For preparing de_table input

get_the_up_genes_for_all_possible_groups To prepare the de_table.ref.marked input.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Make input
# de_table.demo_query <- contrast_each_group_to_the_rest(demo_query_se, "demo_query")
# de_table.demo_ref   <- contrast_each_group_to_the_rest(demo_ref_se,   "demo_ref")

make_ref_similarity_names(de_table.demo_query, de_table.demo_ref)
make_ref_similarity_names(de_table.demo_query, de_table.demo_ref, num_steps=3)
make_ref_similarity_names(de_table.demo_query, de_table.demo_ref, num_steps=NA)


# Make input
# de_table.demo_query <- contrast_each_group_to_the_rest(demo_query_se, "demo_query")
# de_table.demo_ref   <- contrast_each_group_to_the_rest(demo_ref_se,   "demo_ref")

de_table.marked.query_vs_ref <- get_the_up_genes_for_all_possible_groups(
     de_table.demo_query, de_table.demo_ref) 
de_table.marked.reiprocal <- get_the_up_genes_for_all_possible_groups(
     de_table.demo_ref, de_table.demo_query)
     

make_ref_similarity_names_using_marked(de_table.marked.query_vs_ref, 
                                       de_table.marked.reiprocal)
                                       
make_ref_similarity_names_using_marked(de_table.marked.query_vs_ref)

MonashBioinformaticsPlatform/celaref documentation built on June 5, 2019, 11:35 a.m.