sortGenes: sortGenes
In mahmoudibrahim/genesorteR: Feature Ranking in Clustered Single Cell Data

Description Usage Arguments Details Value Author(s) Examples

View source: R/sortGenes.R

sortGenes is the main function of the genesorteR package. It takes a gene expression matrix and cell cluster assignments. It binarizes the expression matrix and calculates empirical statistics on gene expression in each cluster including a specificity score that can be used to rank genes in cell clusters.

sortGenes(
  x,
  classLabels,
  binarizeMethod = "median",
  TF_IDF = FALSE,
  returnInput = TRUE,
  cores = 1
)

`x`	A numeric (sparse) matrix. It will be coerced to a dgCMatrix sparse matrix. Rows represent genes, columns represent cells.
`classLabels`	A numeric or character vector or a factor of the same length as ncol(x) that represents cell cluster assignments. It will be coerced to a factor whose levels are the cell cluster names.
`binarizeMethod`	Either "median" (default) or "naive" or "adaptiveMedian" or a numeric cutoff. See Details.
`TF_IDF`	Return the TF-IDF weigts on the cluster level? `FALSE` by default. See Details.
`returnInput`	Return the input matrix and cell classes? `TRUE` by default. See Details.
`cores`	An integer greater than zero (1 by default) that indicates how many cores to use for parallelization using mclapply.

When binarizeMethod is "median", expression matrix binarization is done by estimating a cutoff value uniformly on all values in the matrix. This is equal to the median of all non-zero entries in the matrix and is returned in cutoff. When binarizeMethod is "adaptiveMedian", expression values of genes are clustered to obtain groups of genes based on expression level, then the "median" method is applied to each group separately. This assumes the matrix supplied in x is log scaled. When binarizeMethod is "naive", all non-zero entries are kept and the minimum value of non-zero entries is returned in cutoff. When the input matrix x has already been binarized, set binarizeMethod to "naive". You can set a specific cutoff value for binarization, by setting binarizeMethod to a numeric value >= 0.

The specificity scores balance the posterior probability of observing a cell cluster given the gene (gene-cluster specificity) with its conditional probability given the cluster (a measure of gene expression). This ensures that highly specific genes are also highly expressed. The specScore matrix is considered the main output of this function, and on which many of the remaining calculations by other functions in the genesorteR package are performed. The values in this matrix can be used to rank features (genes in scRNA-Seq) in clusters.

When TF_IDF is "TRUE", Term Frequency-Inverse Document Frequency weights for each gene in each cell cluster will also be returned. The analogy here is that each cell cluster represents a "document", and each gene a "term". TF-IDF was proposed in the famous paper by KF Jones, 1972 (doi:10.1108/eb026526).

Note that if returnInput is set to FALSE (input expression matrix will no be returned in the output), many of the other functions that accept the output of sortGenes will break.

sortGenes can in principle be applied to both a raw count matrix or a normalized log-count expression matrix.

sortGenes returns a list with the following components:

`binary`	The binarized gene expression matrix. A sparse matrix of class dgCMatrix.
`cutoff`	The cutoff value used to binarize the gene expression matrix. Anything lower than this value was set to zero.
`removed`	A numeric vector containing the row indeces of genes that were removed because they were not expressed in any cells after binarization. If none were removed, this will be `NULL`.
`geneProb`	A numeric vector whose length is equal to nrow(binary) that lists the fraction of cells in which a gene was detected.
`condGeneProb`	A sparse matrix of class dgCMatrix with as many rows as nrow(binary) and as many columns as the number of cell clusters. It includes the conditional probability of observing a gene in a cluster.
`postClustProb`	A sparse matrix of class dgCMatrix with the same size as `condGeneProb`, containing the posterior probability that a cell belongs to a certain cluster or type given that the gene was observed.
`specScore`	A sparse matrix of class dgCMatrix with the same size as `condGeneProb`, containing a specificity score for each gene in each cell cluster. See Details.
`classProb`	A numeric vector whose length is equal to the number of cell clusters, containing the fraction of cells belonging to each cluster.
`inputMat`	the input `x` matrix, after being coerced to a sparse matrix of class dgCMatrix.
`inputClass`	the input `classLabels` after being coerced to a factor.

Mahmoud M Ibrahim <mmibrahim@pm.me>

data(kidneyTabulaMuris)
#basic functionality
gs = sortGenes(kidneyTabulaMuris$exp, kidneyTabulaMuris$cellType)

#the top 10 genes for each cluster by specificity scores
top_genes = apply(gs$specScore, 2, function(x) names(head(sort(x, decreasing = TRUE), n = 10)))

#the same top 10 genes but using the plotTopMarkerHeat function
plotTopMarkerHeat(gs, top_n = 10, outs = TRUE, plotheat = FALSE)

#naive binarization keeps any non-zero input in the input matrix
gs_naive = sortGenes(kidneyTabulaMuris$exp, kidneyTabulaMuris$cellType, binarizeMethod = "naive")

#different genes?
plotTopMarkerHeat(gs_naive, top_n = 10, outs=TRUE, plotheat=FALSE)