getTopHVGs: Identify HVGs

View source: R/getTopHVGs.R

getTopHVGsR Documentation

Identify HVGs

Description

Define a set of highly variable genes, based on variance modelling statistics from modelGeneVar or related functions.

Usage

getTopHVGs(
  stats,
  var.field = "bio",
  n = NULL,
  prop = NULL,
  var.threshold = 0,
  fdr.field = "FDR",
  fdr.threshold = NULL,
  row.names = !is.null(rownames(stats))
)

Arguments

stats

A DataFrame of variance modelling statistics with one row per gene. Alternatively, a SummarizedExperiment object, in which case it is supplied to modelGeneVar to generate the required DataFrame.

var.field

String specifying the column of stats containing the relevant metric of variation.

n

Integer scalar specifying the number of top HVGs to report.

prop

Numeric scalar specifying the proportion of genes to report as HVGs.

var.threshold

Numeric scalar specifying the minimum threshold on the metric of variation.

fdr.field

String specifying the column of stats containing the adjusted p-values. If NULL, no filtering is performed on the FDR.

fdr.threshold

Numeric scalar specifying the FDR threshold.

row.names

Logical scalar indicating whether row names should be reported.

Details

This function will identify all genes where the relevant metric of variation is greater than var.threshold. By default, this means that we retain all genes with positive values in the var.field column of stats. If var.threshold=NULL, the minimum threshold on the value of the metric is not applied.

If fdr.threshold is specified, we further subset to genes that have FDR less than or equal to fdr.threshold. By default, FDR thresholding is turned off as modelGeneVar and related functions determine significance of large variances relative to other genes. This can be overly conservative if many genes are highly variable.

If n=NULL and prop=NULL, the resulting subset of genes is directly returned. Otherwise, the top set of genes with the largest values of the variance metric are returned, where the size of the set is defined as the larger of n and prop*nrow(stats).

Value

A character vector containing the names of the most variable genes, if row.names=TRUE.

Otherwise, an integer vector specifying the indices of stats containing the most variable genes.

Author(s)

Aaron Lun

See Also

modelGeneVar and friends, to generate stats.

modelGeneCV2 and friends, to also generate stats.

Examples

library(scuttle)
sce <- mockSCE()
sce <- logNormCounts(sce)

stats <- modelGeneVar(sce)
str(getTopHVGs(stats))
str(getTopHVGs(stats, fdr.threshold=0.05)) # more stringent

# Or directly pass in the SingleCellExperiment:
str(getTopHVGs(sce))

# Alternatively, use with the coefficient of variation:
stats2 <- modelGeneCV2(sce)
str(getTopHVGs(stats2, var.field="ratio"))


MarioniLab/scran documentation built on Sept. 7, 2024, 6:25 a.m.