feature_gene_selection: Select feature genes
In jichunxie/CDI: Clustering Deviation Index (CDI)

feature_gene_selection

R Documentation

Select feature genes

Description

This function selects a subset of feature genes that are expected to express differently across cell types before calculating CDI.

Usage

feature_gene_selection(
  X,
  batch_label = NULL,
  count_slot = NULL,
  batch_slot = NULL,
  method = "wds",
  nfeature = 500,
  zp_threshold = 0.95
)

Arguments

`X`	The class of X can be "matrix", "Seurat" object, or "SingleCellExperiment" object. If X is a matrix, it should be a UMI count matrix where each row represents a gene, and each column represents a cell. If X is a Seurat object or SingleCellExperiment object, users need to specify where the count matrix, and batch labels are stored in count_slot and batch_slot, respectively.
`batch_label`	A vector of characters indicating the batch labels of the cells. The length of batch_label should be the same as the number of columns in the count matrix. The default value is NULL indicating that there is no batch information available.
`count_slot`	A string indicating the location of raw UMI count. For Seurat object, it is a slot in "RNA" of "assays"; For SingleCellExperiment object, it is a slot in "assays". Each row represents a gene, and each column represents a cell. The genes should be those before feature gene selection.
`batch_slot`	A string indicating the location of batch labels of cells. For Seurat object, it is a slot in meta.data; For SingleCellExperiment object, it is a slot in "colData". The default value is NULL indicating that there is no batch information available.
`method`	A character indicating the method used to select feature genes. "wds" (default) represent the working dispersion score proposed in our study; "vst" is the default method for feature gene selection in FindVariableFeatures function of Seurat package.
`nfeature`	An integer indicating the number of features to select. The default value is 500.
`zp_threshold`	A number of zero proportion threshold. The range should be (0,1). Genes with zero proportion greater than this value will be excluded before feature selection.

Value

A vector of indices of selected feature genes corresponding to the row indices of input count matrix.

References

Stuart and Butler et al. Comprehensive Integration of Single-Cell Data. Cell (2019) [Seurat V3]

Examples

## Simulate a matrix of 100 rows (genes), where the first 50 genes have
## different mean expression levels.
## Apply feature_gene_selection to select genes
ng <- 100; nc <- 100
set.seed(1)
X <- cbind(
	matrix(
		c(rnbinom(ng*nc/4, size = 1, mu = 0.1),
			rnbinom(ng*nc/4, size = 1, mu = 0.5)),
		nrow = ng, 
		byrow = TRUE),
	matrix(
		c(rnbinom(ng*nc/4, size = 1, mu = 1),
			rnbinom(ng*nc/4, size = 1, mu = 0.5)),
		nrow = ng, 
		byrow = TRUE))
colnames(X) <- paste0('c', seq_len(nc))
rownames(X) <- paste0('g', seq_len(ng))
Batches <- rep(seq_len(2), nc/2)

## Input: matrix
feature_gene_selection(
	X = X,
	batch_label = Batches,
	nfeature = 20,
	zp_threshold = 0.95)


## Input: SingleCellExperiment object
library(SingleCellExperiment)
sim_sce <- SingleCellExperiment(
  list(count = X),
  colData = data.frame(
  	Cell_name = colnames(X),
		batch = Batches),
	rowData = data.frame(gene_name = rownames(X)))
selected_genes <- feature_gene_selection(
  X = sim_sce,
  count_slot = "count",
  batch_slot = "batch",
  nfeature = 20,
  zp_threshold = 0.95)

## Input: Seurat object
library(Seurat)
library(SeuratObject)
sim_seurat <- CreateSeuratObject(counts = as.data.frame(X))
sim_seurat <- AddMetaData(sim_seurat, colnames(X), "Cell_name")
sim_seurat <- AddMetaData(sim_seurat, Batches, "batch")
selected_genes <- feature_gene_selection(
	X = sim_seurat,
	count_slot = "counts",
	batch_slot = "batch",
	nfeature = 20,
	zp_threshold = 0.95)

jichunxie/CDI documentation built on Dec. 6, 2023, 5:48 a.m.