Parser | R Documentation |
Garnett uses a marker file to allow users to specify cell type definitions. While the marker file is designed to be easy to construct and human-readable, it is parsed by Garnett automatically, and so it needs to follow certain formatting constraints.
Parser
An object of class R6ClassGenerator
of length 24.
The following describes the constraints necessary in the input to the
marker_file
argument of train_cell_classifier
and
check_markers
.
The basic structure of the Garnett marker file is a series of entries, each describing elements of a cell type. After the cell name, each additional line will be a descriptor, which begins with a keyword, followed by a colon (':'). After the colon, a series of specifications can be added, separated by commas (','). Descriptors may spill onto following lines so long as you do not split a specification across multiple lines (i.e. if breaking up a long descriptor across multiple lines, all but the last line should end with a comma). Each new descriptor should begin on a new line. A generic cell type entry looks like this:
“' > cell type name descriptor: spec1, spec2, spec3, spec4 descriptor2: spec1 “'
The following are the potential descriptors:
Required Each cell type must have a unique name,
and the name should head the cell type description. To indicate a new cell
type, use the >
symbol, followed by the cell name, followed by a
new line. For example, > T cell
.
Required After the cell name, the minimal
requirement for each cell type is the name of a single marker gene. The
line in the marker file will begin with expressed:
, followed by one
or more gene names separated by commas. The last gene name of the
descriptor is not followed by a comma. Gene IDs can be of any type
(ENSEMBL, SYMBOL, etc.) that is present in the Bioconductor
AnnotationDb-class
package for your species.
(See available packages on the
Bioconductor website).
For example, for human, use org.Hs.eg.db
. To
see available gene ID types, you can run columns(db)
. You will
specify which gene ID type you used when calling
train_cell_classifier
.
If your species does not have an
annotation dataset of type AnnotationDb-class
,
you can set db = 'none'
, however Garnett will then not convert gene
ID types, so CDS and marker file gene ID types need to be the same.
In addition to
specifying genes that the cell type should express, you can also specify
genes that your cell type should not express. Details on specifying genes
are the same as for expressed:
.
When present, this descriptor specifies that a cell
type is a subtype of another cell type that is also described in the
marker file. A biological example would be a CD4 T cell being a subtype of
a T cell. This descriptor causes the cell type to be classified on a
separate sub-level of the classification hierarchy, after the
classification of its parent type is done (i.e. first T cells are
discriminated from other cell types, then the T cells are subclassified
using any cell types with the descriptor subtype of: T cell
).
subtype of:
can only include a single specification, and the
specification must be the exact name of another cell type specified in
this marker file.
This descriptor is not required, but is highly recommended. The specifications for this descriptor should be links/DOIs documenting how you chose your marker genes. While these specifications will not influence cell type classification, they will be packaged with the built classifier so that future users of the classifier can trace the origins of the markers/
This wildcard descriptor allows you to specify any
other property of a cell type that you wish to specify. The keyword will
be the name of the column in your pData
(meta data) table that you
wish to specify, and the specifications will be a list of acceptable
values for that meta data. An example use of this would be
tissue: liver, kidney
, which would specify that training cells for
this cell type must have "liver" or "kidney" as their entry in the
"tissue" column of the pData
table.
While we recommend that you use expressed:
and not expressed:
to specify the cell type's marker genes, because
these terms utilize the entirety of Garnett's built-in normalization and
standardization, you can also specify expression using the following
logical descriptors
expressed below:, expressed above:, expressed between:
.
Note that no normalization occurs with these descriptors; they are used as
logical gates only. To specify expressed below:
, use the gene name,
followed by a space, followed by a number. This will only allow training
cells that have this gene expressed below the given value in the
units of the expression matrix provided. For example,
expressed below: MYOD1 7, MYH3 2
.
Similar to expressed below:
, but will only
allow training cells expressing the given gene above the value provided.
Similar to expressed below:
, but provide
two values separated by spaces. For example
expressed between: ACT5 2 5.5, ACT2 1 2.7
. This descriptor will
only allow training cells expressing the given gene between the two values
provided.
Because only specific expressed markers are useful for Garnett
classification, we recommend that you always check your marker file for
ambiguity before proceeding with classification. To do this, we have
provided the functions check_markers
and
plot_markers
to facilitate marker checking. See that manual
pages for those functions for details.
train_cell_classifier
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.