check_markers: Check marker file

View source: R/utils.R

check_markersR Documentation

Check marker file

Description

Check the markers chosen for the marker file and generate a table of useful statistics. The output of this function can be fed into plot_markers to generate a diagnostic plot.

Usage

check_markers(cds, marker_file, db, cds_gene_id_type = "SYMBOL",
  marker_file_gene_id_type = "SYMBOL", propogate_markers = TRUE,
  use_tf_idf = TRUE, classifier_gene_id_type = "ENSEMBL")

Arguments

cds

Input CDS object.

marker_file

A character path to the marker file to define cell types. See details and documentation for Parser by running ?Parser for more information.

db

Bioconductor AnnotationDb-class package for converting gene IDs. For example, for humans use org.Hs.eg.db. See available packages at Bioconductor. If your organism does not have an AnnotationDb-class database available, you can specify "none", however then Garnett will not check/convert gene IDs, so your CDS and marker file must have the same gene ID type.

cds_gene_id_type

The type of gene ID used in the CDS. Should be one of the values in columns(db). Default is "ENSEMBL". Ignored if db = "none".

marker_file_gene_id_type

The type of gene ID used in the marker file. Should be one of the values in columns(db). Default is "SYMBOL". Ignored if db = "none".

propogate_markers

Logical. Should markers from child nodes of a cell type be used in finding representatives of the parent type? Should generally be TRUE.

use_tf_idf

Logical. Should TF-IDF matrix be calculated during estimation? If TRUE, estimates will be more accurate, but calculation is slower with very large datasets.

classifier_gene_id_type

The type of gene ID that will be used in the classifier. If possible for your organism, this should be "ENSEMBL", which is the default. Ignored if db = "none".

Details

This function checks the chosen cell type markers in the marker file provided to ensure they are good candidates for use in classification. The function works by estimating which cells will be chosen given each marker gene and returning some statistics for each marker. Note that this function does not take into account meta data information when calculating statistics.

The output data.frame has several columns:

marker_gene

Gene name as provided in the marker file

ENSEMBL

The corresponding ensembl ID derived from db conversion

parent

The parent cell type in the cell type hierarchy - 'root' if top level

cell_type

The cell type the marker belongs to

in_cds

Whether the marker is present in the CDS

nominates

The number of cells the marker is estimated to nominate to the cell type

total_nominated

The total number of cells nominated by all the markers for that cell type

exclusion_dismisses

The number of cells no longer nominated to the cell type if this marker is excluded (i.e. not captured by other markers for the cell type)

inclusion_ambiguates

How many cells become ambiguous (i.e. are nominated to multiple cell types) if this marker is included

most_overlap

The cell type that most often shares this marker (i.e. is the other side of the ambiguity). If inclusion_ambiguates is 0, most_overlap is NA

ambiguity

inclusion_ambiguates/nominates - if high, consider excluding this marker

marker_score

(1/(ambiguity + .01)) * nominates/total_nominated - a general measure of the quality of a marker. Higher is better

summary

A summary column that identifies potential problems with the provided markers

Value

Data.frame of marker check results.

Examples

library(org.Hs.eg.db)
data(test_cds)

# generate size factors for normalization later
test_cds <- estimateSizeFactors(test_cds)
marker_file_path <- system.file("extdata", "pbmc_bad_markers.txt",
                                package = "garnett")
marker_check <- check_markers(test_cds, marker_file_path,
                              db=org.Hs.eg.db,
                              cds_gene_id_type = "SYMBOL",
                              marker_file_gene_id_type = "SYMBOL")


cole-trapnell-lab/garnett documentation built on Jan. 6, 2025, 2:18 p.m.