CHETAHclassifier: Identification of cell types aided by hierarchical clustering

Description Usage Arguments Details Value Examples

Description

CHETAH classifies an input dataset by comparing it to a reference dataset in a stepwise, top-to-bottom fashion. See 'details' for a full explanation. NOTE: We recommend to use all the default parameters

Usage

1
2
3
4
5
6
7
8
9
CHETAHclassifier(input, ref_cells = NULL, ref_profiles = NULL,
  ref_ct = "celltypes", input_c = NA, ref_c = NA, thresh = 0.1,
  gs_method = c("fc", "wilcox"), cor_method = c("spearman", "kendall",
  "pearson", "cosine"), clust_method = c("average", "single", "complete",
  "ward.D2", "ward.D", "mcquitty", "median", "centroid"),
  clust_dist = bioDist::spearman.dist, n_genes = 200,
  pc_thresh = 0.2, p_thresh = 0.05, fc_thresh = 1.5,
  subsample = FALSE, fix_ngenes = TRUE, plot.tree = FALSE,
  only_pos = FALSE, print_steps = FALSE)

Arguments

input

required: an input SingleCellExperiment. (see: Bioconductor, and the vignette browseVignettes("CHETAH"))

ref_cells

required: A reference SingleCellExperiment, with the cell types in the "celltypes" colData (or otherwise defined in ref_ct.

ref_profiles

optional In case of bulk-RNA seq or micro-arrays, an expression matrix with one (average) reference expression profile per cell type in the columns. ('ref_cells' must be left empty)

ref_ct

the colData of ref_cells where the cell types are stored.

input_c

the name of the assay of the input to use. NA (default) will use the first one.

ref_c

same as input_c, but for the reference.

thresh

the initial confidence threshold, which can be changed after running by Classify)

gs_method

method for gene selection. In every node of the tree: "fc" = quick method: either a fixed number (n_genes) of genes is selected with the highest fold-change (default), or genes are selected that have a fold-change higher than fc_thresh (the latter is used when fix_ngenes = FALSE) .
"wilcox": genes are selected based on fold-change (fc_thresh), percentage of expression (pc_thresh) and p-values (p_thresh), p-values are found by the wilcox test.

cor_method

the correlation measure: one of: "spearman" (default), "kendall", "pearson", "cosine"

clust_method

the method used for clustering the reference profiles. One of the methods from hclust

clust_dist

a distance measure, default: spearman.dist

n_genes

The number of genes used in every step. Only used if fix_ngenes = TRUE

pc_thresh

when: gs_method = "wilcox", only genes are selected for which more than a pc_tresh fraction of a reference group of cells express that gene

p_thresh

when: gs_method = "wilcox" , only genes are selected that have a p-value < p_thresh

fc_thresh

when: gs_method = "wilcox" or gs_method = "fc" AND fix_ngenes = FALSE, only genes are selected that have a log2 fld-change > fc_thresh between two reference groups.
if this mode is selected, the reference must be in the log2 space.

subsample

to prevent reference types with a lot of cells to influence the gene selection, subsample types with more that subsample cells

fix_ngenes

when: gs_method = "fc" use a fixed number of genes for all correlations. when: gs_method = "wilcox" use a maximum of genes per step. When fix_ngenes = FALSE & gs_methode = "fc" fc_thresh is used to define the fold-change cut-off for gene selection.

plot.tree

Plot the classification tree.

only_pos

not recommended: only use genes for a reference type that are higher expressed in that type, than the others in that node.

print_steps

whether the number of genes (postive and negative) per step per ref_cell_type should be printed

Details

CHETAH will hierarchically cluster reference data to produce a classification tree (ct). In each node of the ct, CHETAH will assign each input cell to on of the two branches, based on gene selections, correlations and calculation of profile and confidence scores. The assignement will only performed if the confidence score for such an assignment is higher than the Confidence Threshold. If this is not the case, classification for the cell will stop in the current node. Some input cells will reach the leaf nodes of the ct (the pre-defined cell types), these classifications are called final types For other cells, assignment will stop in a node. These classifications are called intermediate types.

Value

A SingleCellExperiment with added: - input$celltype_CHETAH a named character vector that can directly be used in any other workflow/method. - "hidden" 'int_colData' and 'int_metadata', not meant for direct interaction, but which can all be viewed and interacted with using: 'PlotCHETAH' and 'CHETAHshiny' A list containing the following objects is added to input$int_metadata$CHETAH

A nested DataFrame is added to input$int_colData$CHETAH. It holds 3 top-levels DataFrames

Examples

1
2
3
4
5
## Melanoma data from Tirosh et al. (2016) Science
input_mel
## Head-Neck data from Puram et al. (2017) Cancer Cell
headneck_ref
input_mel <- CHETAHclassifier(input = input_mel, ref_cells = headneck_ref)

CHETAH documentation built on Nov. 8, 2020, 8:02 p.m.