train_cell_classifier: Train a cell type classifier
In cole-trapnell-lab/garnett: Automated cell type classification

View source: R/train_cell_classifier.R

train_cell_classifier

R Documentation

Train a cell type classifier

Description

This function takes single-cell expression data in the form of a CDS object and a cell type definition file (marker file) and trains a multinomial classifier to assign cell types. The resulting garnett_classifier object can be used to classify the cells in the same dataset, or future datasets from similar tissues/samples.

Usage

train_cell_classifier(cds, marker_file, db, cds_gene_id_type = "ENSEMBL",
  marker_file_gene_id_type = "SYMBOL", min_observations = 8,
  max_training_samples = 500, num_unknown = 500,
  propogate_markers = TRUE, cores = 1, lambdas = NULL,
  classifier_gene_id_type = "ENSEMBL", return_initial_assign = FALSE)

Arguments

`cds`	Input CDS object.
`marker_file`	A character path to the marker file to define cell types. See details and documentation for `Parser` by running `?Parser`for more information.
`db`	Bioconductor AnnotationDb-class package for converting gene IDs. For example, for humans use org.Hs.eg.db. See available packages at Bioconductor. If your organism does not have an AnnotationDb-class database available, you can specify "none", however then Garnett will not check/convert gene IDs, so your CDS and marker file must have the same gene ID type.
`cds_gene_id_type`	The type of gene ID used in the CDS. Should be one of the values in `columns(db)`. Default is "ENSEMBL". Ignored if db = "none".
`marker_file_gene_id_type`	The type of gene ID used in the marker file. Should be one of the values in `columns(db)`. Default is "SYMBOL". Ignored if db = "none".
`min_observations`	An integer. The minimum number of representative cells per cell type required to include the cell type in the predictive model. Default is 8.
`max_training_samples`	An integer. The maximum number of representative cells per cell type to be included in the model training. Decreasing this number increases speed, but may hurt performance of the model. Default is 500.
`num_unknown`	An integer. The number of unknown type cells to use as an outgroup during classification. Default is 500.
`propogate_markers`	Logical. Should markers from child nodes of a cell type be used in finding representatives of the parent type? Should generally be `TRUE`.
`cores`	An integer. The number of cores to use for computation.
`lambdas`	`NULL` or a numeric vector. Allows the user to pass their own lambda values to `cv.glmnet`. If `NULL`, preset lambda values are used.
`classifier_gene_id_type`	The type of gene ID that will be used in the classifier. If possible for your organism, this should be "ENSEMBL", which is the default. Ignored if db = "none".
`return_initial_assign`	Logical indicating whether an initial assignment data frame for the root level should be returned instead of a classifier. This can be useful while choosing/debugging markers. Please note that this means that a classifier will not be built, so you will not be able to move on to the next steps of the workflow until you rerun the functionwith `return_initial_assign = FALSE`. Default is `FALSE`.

Details

This function has three major parts: 1) parsing the marker file 2) choosing cell representatives and 3) training the classifier. Details on each of these steps is below:

Parsing the marker file: the first step of this function is to parse the provided marker file. The marker file is a representation of the cell types expected in the data and known characteristics about them. Information about marker file syntax is available in the documentation for the Parser function, and on the Garnett website.

Choosing cell representatives: after parsing the marker file, this function identifies cells that fit the parameters specified in the file for each cell type. Depending on how marker genes and other cell type definition information are specified, expression data is normalized and expression cutoffs are defined automatically. In addition to the cell types in the marker file, an outgroup of diverse cells is also chosen.

Training the classifier: lastly, this function trains a multinomial GLMnet classifier on the chosen representative cells.

Because cell types can be defined hierarchically (i.e. cell types can be subtypes of other cell types), steps 2 and 3 above are performed iteratively over all internal nodes in the tree representation of cell types.

See the Garnett website and the accompanying paper for further details.

Examples

library(org.Hs.eg.db)
data(test_cds)
set.seed(260)

marker_file_path <- system.file("extdata", "pbmc_bad_markers.txt",
                                package = "garnett")

test_classifier <- train_cell_classifier(cds = test_cds,
                                         marker_file = marker_file_path,
                                         db=org.Hs.eg.db,
                                         min_observations = 10,
                                         cds_gene_id_type = "SYMBOL",
                                         num_unknown = 50,
                                         marker_file_gene_id_type = "SYMBOL")

cole-trapnell-lab/garnett documentation built on Jan. 6, 2025, 2:18 p.m.