CHETAHclassifier: Identification of cell types aided by hierarchical clustering
In CHETAH: Fast and accurate scRNA-seq cell type identification

Description Usage Arguments Details Value Examples

CHETAH classifies an input dataset by comparing it to a reference dataset in a stepwise, top-to-bottom fashion. See 'details' for a full explanation. NOTE: We recommend to use all the default parameters

CHETAHclassifier(input, ref_cells = NULL, ref_profiles = NULL,
  ref_ct = "celltypes", input_c = NA, ref_c = NA, thresh = 0.1,
  gs_method = c("fc", "wilcox"), cor_method = c("spearman", "kendall",
  "pearson", "cosine"), clust_method = c("average", "single", "complete",
  "ward.D2", "ward.D", "mcquitty", "median", "centroid"),
  clust_dist = bioDist::spearman.dist, n_genes = 200,
  pc_thresh = 0.2, p_thresh = 0.05, fc_thresh = 1.5,
  subsample = FALSE, fix_ngenes = TRUE, plot.tree = FALSE,
  only_pos = FALSE, print_steps = FALSE)

`input`	required: an input SingleCellExperiment. (see: Bioconductor, and the vignette `browseVignettes("CHETAH")`)
`ref_cells`	required: A reference SingleCellExperiment, with the cell types in the "celltypes" colData (or otherwise defined in `ref_ct`.
`ref_profiles`	optional In case of bulk-RNA seq or micro-arrays, an expression matrix with one (average) reference expression profile per cell type in the columns. ('ref_cells' must be left empty)
`ref_ct`	the colData of `ref_cells` where the cell types are stored.
`input_c`	the name of the assay of the input to use. `NA` (default) will use the first one.
`ref_c`	same as `input_c`, but for the reference.
`thresh`	the initial confidence threshold, which can be changed after running by `Classify`)
`gs_method`	method for gene selection. In every node of the tree: "fc" = quick method: either a fixed number (`n_genes`) of genes is selected with the highest fold-change (default), or genes are selected that have a fold-change higher than `fc_thresh` (the latter is used when `fix_ngenes = FALSE`) . "wilcox": genes are selected based on fold-change (`fc_thresh`), percentage of expression (`pc_thresh`) and p-values (`p_thresh`), p-values are found by the wilcox test.
`cor_method`	the correlation measure: one of: "spearman" (default), "kendall", "pearson", "cosine"
`clust_method`	the method used for clustering the reference profiles. One of the methods from `hclust`
`clust_dist`	a distance measure, default: `spearman.dist`
`n_genes`	The number of genes used in every step. Only used if `fix_ngenes = TRUE`
`pc_thresh`	when: gs_method = "wilcox", only genes are selected for which more than a `pc_tresh` fraction of a reference group of cells express that gene
`p_thresh`	when: gs_method = "wilcox" , only genes are selected that have a p-value < `p_thresh`
`fc_thresh`	when: gs_method = "wilcox" or gs_method = "fc" AND fix_ngenes = FALSE, only genes are selected that have a log2 fld-change > `fc_thresh` between two reference groups. if this mode is selected, the reference must be in the log2 space.
`subsample`	to prevent reference types with a lot of cells to influence the gene selection, subsample types with more that `subsample` cells
`fix_ngenes`	when: gs_method = "fc" use a fixed number of genes for all correlations. when: gs_method = "wilcox" use a maximum of genes per step. When `fix_ngenes = FALSE & gs_methode = "fc"` `fc_thresh` is used to define the fold-change cut-off for gene selection.
`plot.tree`	Plot the classification tree.
`only_pos`	not recommended: only use genes for a reference type that are higher expressed in that type, than the others in that node.
`print_steps`	whether the number of genes (postive and negative) per step per ref_cell_type should be printed

CHETAH will hierarchically cluster reference data to produce a classification tree (ct). In each node of the ct, CHETAH will assign each input cell to on of the two branches, based on gene selections, correlations and calculation of profile and confidence scores. The assignement will only performed if the confidence score for such an assignment is higher than the Confidence Threshold. If this is not the case, classification for the cell will stop in the current node. Some input cells will reach the leaf nodes of the ct (the pre-defined cell types), these classifications are called final types For other cells, assignment will stop in a node. These classifications are called intermediate types.

A SingleCellExperiment with added: - input$celltype_CHETAH a named character vector that can directly be used in any other workflow/method. - "hidden" 'int_colData' and 'int_metadata', not meant for direct interaction, but which can all be viewed and interacted with using: 'PlotCHETAH' and 'CHETAHshiny' A list containing the following objects is added to input$int_metadata$CHETAH

classification a named vector: the classified types with the corresponding names of the input cells
tree the hclust object of the classification tree
nodetypes A list with the cell types under each node
nodecoor the coordinates of the nodes of the classification tree
genes A list per node, containing a list per reference type with the genes used for the profile scores of that type
parameters The parameters used

A nested DataFrame is added to input$int_colData$CHETAH. It holds 3 top-levels DataFrames

prof_scores A list with the profile scores
conf_scores A list with the confidence scores
correlations A list with the correlations of the input cells to the reference profiles

## Melanoma data from Tirosh et al. (2016) Science
input_mel
## Head-Neck data from Puram et al. (2017) Cancer Cell
headneck_ref
input_mel <- CHETAHclassifier(input = input_mel, ref_cells = headneck_ref)