Parser: Parsing the Garnett marker file

ParserR Documentation

Parsing the Garnett marker file

Description

Garnett uses a marker file to allow users to specify cell type definitions. While the marker file is designed to be easy to construct and human-readable, it is parsed by Garnett automatically, and so it needs to follow certain formatting constraints.

Usage

Parser

Format

An object of class R6ClassGenerator of length 24.

Details

The following describes the constraints necessary in the input to the marker_file argument of train_cell_classifier and check_markers.

Elements of a cell type description

The basic structure of the Garnett marker file is a series of entries, each describing elements of a cell type. After the cell name, each additional line will be a descriptor, which begins with a keyword, followed by a colon (':'). After the colon, a series of specifications can be added, separated by commas (','). Descriptors may spill onto following lines so long as you do not split a specification across multiple lines (i.e. if breaking up a long descriptor across multiple lines, all but the last line should end with a comma). Each new descriptor should begin on a new line. A generic cell type entry looks like this:

“' > cell type name descriptor: spec1, spec2, spec3, spec4 descriptor2: spec1 “'

The following are the potential descriptors:

cell name

Required Each cell type must have a unique name, and the name should head the cell type description. To indicate a new cell type, use the > symbol, followed by the cell name, followed by a new line. For example, > T cell.

expressed:

Required After the cell name, the minimal requirement for each cell type is the name of a single marker gene. The line in the marker file will begin with expressed:, followed by one or more gene names separated by commas. The last gene name of the descriptor is not followed by a comma. Gene IDs can be of any type (ENSEMBL, SYMBOL, etc.) that is present in the Bioconductor AnnotationDb-class package for your species. (See available packages on the Bioconductor website). For example, for human, use org.Hs.eg.db. To see available gene ID types, you can run columns(db). You will specify which gene ID type you used when calling train_cell_classifier.

If your species does not have an annotation dataset of type AnnotationDb-class, you can set db = 'none', however Garnett will then not convert gene ID types, so CDS and marker file gene ID types need to be the same.

not expressed:

In addition to specifying genes that the cell type should express, you can also specify genes that your cell type should not express. Details on specifying genes are the same as for expressed:.

subtype of:

When present, this descriptor specifies that a cell type is a subtype of another cell type that is also described in the marker file. A biological example would be a CD4 T cell being a subtype of a T cell. This descriptor causes the cell type to be classified on a separate sub-level of the classification hierarchy, after the classification of its parent type is done (i.e. first T cells are discriminated from other cell types, then the T cells are subclassified using any cell types with the descriptor subtype of: T cell). subtype of: can only include a single specification, and the specification must be the exact name of another cell type specified in this marker file.

references:

This descriptor is not required, but is highly recommended. The specifications for this descriptor should be links/DOIs documenting how you chose your marker genes. While these specifications will not influence cell type classification, they will be packaged with the built classifier so that future users of the classifier can trace the origins of the markers/

*meta data:

This wildcard descriptor allows you to specify any other property of a cell type that you wish to specify. The keyword will be the name of the column in your pData (meta data) table that you wish to specify, and the specifications will be a list of acceptable values for that meta data. An example use of this would be tissue: liver, kidney, which would specify that training cells for this cell type must have "liver" or "kidney" as their entry in the "tissue" column of the pData table.

expressed below:

While we recommend that you use expressed: and not expressed: to specify the cell type's marker genes, because these terms utilize the entirety of Garnett's built-in normalization and standardization, you can also specify expression using the following logical descriptors expressed below:, expressed above:, expressed between:. Note that no normalization occurs with these descriptors; they are used as logical gates only. To specify expressed below:, use the gene name, followed by a space, followed by a number. This will only allow training cells that have this gene expressed below the given value in the units of the expression matrix provided. For example, expressed below: MYOD1 7, MYH3 2.

expressed above:

Similar to expressed below:, but will only allow training cells expressing the given gene above the value provided.

expressed between:

Similar to expressed below:, but provide two values separated by spaces. For example expressed between: ACT5 2 5.5, ACT2 1 2.7. This descriptor will only allow training cells expressing the given gene between the two values provided.

Checking your marker file

Because only specific expressed markers are useful for Garnett classification, we recommend that you always check your marker file for ambiguity before proceeding with classification. To do this, we have provided the functions check_markers and plot_markers to facilitate marker checking. See that manual pages for those functions for details.

See Also

train_cell_classifier


cole-trapnell-lab/garnett documentation built on Jan. 6, 2025, 2:18 p.m.