selectGenes: Select a subset of informative genes
In rliger: Linked Inference of Genomic Experimental Relationships

selectGenes

R Documentation

Select a subset of informative genes

Description

This function identifies highly variable genes from each dataset and combines these gene sets (either by union or intersection) for use in downstream analysis. Assuming that gene expression approximately follows a Poisson distribution, this function identifies genes with gene expression variance above a given variance threshold (relative to mean gene expression). Alternatively, we allow selecting a desired number of genes for each dataset by ranking the relative variance, and then take the combination.

Usage

selectGenes(object, thresh = 0.1, nGenes = NULL, alpha = 0.99, ...)

## S3 method for class 'liger'
selectGenes(
  object,
  thresh = 0.1,
  nGenes = NULL,
  alpha = 0.99,
  useDatasets = NULL,
  useUnsharedDatasets = NULL,
  unsharedThresh = 0.1,
  combine = c("union", "intersection"),
  chunk = getOption("ligerChunkSize", 20000),
  verbose = getOption("ligerVerbose", TRUE),
  var.thresh = thresh,
  alpha.thresh = alpha,
  num.genes = nGenes,
  datasets.use = useDatasets,
  unshared.datasets = useUnsharedDatasets,
  unshared.thresh = unsharedThresh,
  tol = NULL,
  do.plot = NULL,
  cex.use = NULL,
  unshared = NULL,
  ...
)

## S3 method for class 'Seurat'
selectGenes(
  object,
  thresh = 0.1,
  nGenes = NULL,
  alpha = 0.99,
  useDatasets = NULL,
  layer = "ligerNormData",
  assay = NULL,
  datasetVar = "orig.ident",
  combine = c("union", "intersection"),
  verbose = getOption("ligerVerbose", TRUE),
  ...
)

Arguments

`object`	A liger, ligerDataset or `Seurat` object, with normalized data available (no scale factor multipled nor log transformed).
`thresh`	Variance threshold used to identify variable genes. Higher threshold results in fewer selected genes. Liger and Seurat S3 methods accept a single value or a vector with specific threshold for each dataset in `useDatasets`.* Default `0.1`.
`nGenes`	Number of genes to find for each dataset. By setting this, we optimize the threshold used for each dataset so that we get `nGenes` selected features for each dataset. Accepts single value or a vector for dataset specific setting matching `useDataset`.* Default `NULL` does not optimize.
`alpha`	Alpha threshold. Controls upper bound for expected mean gene expression. Lower threshold means higher upper bound. Default `0.99`.
`...`	Arguments passed to other methods.
`useDatasets`	A character vector of the names, a numeric or logical vector of the index of the datasets to use for shared variable feature selection. Default `NULL` uses all datasets.
`useUnsharedDatasets`	A character vector of the names, a numeric or logical vector of the index of the datasets to use for finding unshared variable features. Default `NULL` does not attempt to find unshared features.
`unsharedThresh`	The same thing as `thresh` that is applied to test unshared features. A single value for all datasets in `useUnsharedDatasets` or a vector for dataset-specific setting.* Default `0.1`.
`combine`	How to combine variable genes selected from all datasets. Choose from `"union"` or `"intersection"`. Default `"union"`.
`chunk`	Integer. Number of maximum number of cells in each chunk, when gene selection is applied to any HDF5 based dataset. Default `20000`.
`verbose`	Logical. Whether to show information of the progress. Default `getOption("ligerVerbose")` or `TRUE` if users have not set.
`var.thresh`, `alpha.thresh`, `num.genes`, `datasets.use`, `unshared.datasets`, `unshared.thresh`	Deprecated. These arguments are renamed and will be removed in the future. Please see function usage for replacement.
`tol`, `do.plot`, `cex.use`, `unshared`	Deprecated. Gene variability metric is now visualized with separated function `plotVarFeatures`. Users can now set none-NULL `useUnsharedDatasets` to select unshared genes, instead of having to switch `unshared` on.
`layer`	Where the input normalized counts should be from. Default `"ligerNormData"`. For older Seurat, always retrieve from `data` slot.
`assay`	Name of assay to use. Default `NULL` uses current active assay.
`datasetVar`	Metadata variable name that stores the dataset source annotation. Default `"orig.ident"`.

Value

Updated object

liger method - Each involved dataset stored in ligerDataset is updated with its featureMeta slot and varUnsharedFeatures slot (if requested with useUnsharedDatasets), while varFeatures(object) will be updated with the final combined gene set.
Seurat method - Final selection will be updated at Seurat::VariableFeatures(object). Per-dataset information is stored in the meta.features slot of the chosen Assay.

Examples

pbmc <- normalize(pbmc)
# Select basing on thresholding the relative variance
pbmc <- selectGenes(pbmc, thresh = .1)
# Select specified number for each dataset
pbmc <- selectGenes(pbmc, nGenes = c(60, 60))

rliger documentation built on June 8, 2025, 1:56 p.m.