calculate_CDI: Clustering Deviance Index (CDI)

View source: R/CalculateCDI.R

calculate_CDIR Documentation

Clustering Deviance Index (CDI)

Description

This function calculates CDI-AIC and CDI-BIC for each candidate set of cell labels. CDI calculates AIC and BIC of cell-type-specific gene-specific NB model for UMI counts, where the cell types are based on each candidate label set, and only the selected subset of genes are considered. Whether to use CDI-AIC or CDI-BIC depend on the goals. We suggest using CDI-BIC to select optimal main cell types and using CDI-AIC to select optimal subtypes, because BIC puts more penalty on the complexity of models (number of clusters).

Usage

calculate_CDI(
  X,
  feature_gene_index = NULL,
  cand_lab_df,
  cell_size_factor,
  batch_label = NULL,
  count_slot = NULL,
  batch_slot = NULL,
  lrt_pval_threshold = 0.01,
  clustering_method = NULL,
  BPPARAM = SerialParam()
)

Arguments

X

The class of X can be "matrix", "Seurat" object, or "SingleCellExperiment" object. If X is a matrix, it should be a UMI count matrix where each row represents a gene, and each column represents a cell. If X is a Seurat object or SingleCellExperiment object, users need to specify where the count matrix and batch labels are stored in count_slot and batch_slot, respectively. If feature_gene_index is NULL, genes in X should only included feature genes (that are selected by feature_gene_selection function); if feature_gene_index is not NULL, this function will extract a subset of X with genes indexed by feature_gene_index.

feature_gene_index

A vector of unique integers indicating the indices of feature genes. The default value if NULL, which means all genes in X will be used to calculate CDI. The integers in feature_gene_index need to be no greater than the number of genes in X.

cand_lab_df

A vector of cluster labels of the cells or a data frame where each column corresponds to one set of cluster labels of the cells. This (these) label sets can be clustering results obtained by any clustering methods. The length (number of rows) of cand_lab_df should be the same as the number of columns in the count matrix. If the column names of label set data frame are provided with the format "[ClusteringMethod]_k[NumberOfClusters]" such as "KMeans_K5, 'calculate_CDI' will extract the "[ClusteringMethod]" as the Cluster_method. The clustering method can also be provided in the argument "clustering_method" for each label set.

cell_size_factor

A numeric vector indicating the size factor of the cells. This should be the output of function size_factor. The length of cell_size_factor should be the same as the number of columns in the count matrix.

batch_label

A vector of characters indicating the batch labels of the cells. The length of batch_label should be the same as the number of columns in the count matrix.

count_slot

A string indicating the location of raw UMI count. For Seurat object, it is a slot in "RNA" of "assays"; For SingleCellExperiment object, it is a slot in "assays". Each row represents a gene, and each column represents a cell. The genes should be those before feature gene selection.

batch_slot

A string indicating the location of batch labels of cells. For Seurat object, it is a slot in meta.data; For SingleCellExperiment object, it is a slot in "colData". The default value is NULL indicating that there is no batch information available.

lrt_pval_threshold

A numeric value within (0, 1) indicating the p-value threshold for the likelihood ratio test (LRT). If multiple batches exist, within each cluster and each gene, CDI will test whether a batch-common NB model or a batch-specific NB model should be fitted with the LRT. If the p-value is less than this threshold, a batch-specific NB model will be fitted. Otherwise, a batch-common NB model will be fitted.

clustering_method

A vector of characters indicating the corresponding clustering method for each label set. The length of the vector needs to be the same as the number of columns in cand_lab_df.

BPPARAM

A BiocParallelParam object from the BiocParallel package. By specifying this argument, users can control over how to perform the parallel computing. Default is SerialParam which uses a single core.

Value

calculate_CDI returns a data frame with 5 columns. The columns are Label_name (name of each label set), Cluster_method (clustering method), CDI-AIC, CDI-BIC, and N_cluster (number of clusters). Each row corresponds to one set of cell labels.

References

SMartin Morgan, Valerie Obenchain, Michel Lang, Ryan Thompson and Nitesh Turaga (2021). \Sexpr[results=rd]{tools:::Rd_expr_doi("https://github.com/Bioconductor/BiocParallel")}

Examples

ng <- 100; nc <- 100
set.seed(1)

# count matrix
X <- cbind(
	matrix(
		c(rnbinom(ng*nc/4, size = 1, mu = 0.1),
			rnbinom(ng*nc/4, size = 1, mu = 0.5)),
		nrow = ng,
		byrow = TRUE),
	matrix(
		c(rnbinom(ng*nc/4, size = 1, mu = 1),
			rnbinom(ng*nc/4, size = 1, mu = 0.5)),
		nrow = ng,
		byrow = TRUE))
colnames(X) <- paste0('c', seq_len(nc))
rownames(X) <- paste0('g', seq_len(ng))

# batch label
Batches <- rep(seq_len(2), nc/2)

# cell clustering labels
Method1_k2 <- rep(seq_len(2), c(nc/2,nc/2))
Method1_k3 <- sample(seq_len(3), nc, replace = TRUE)
label_df <- data.frame(
	Method1_k2 = Method1_k2,
	Method1_k3 = Method1_k3)

## select feature genes (see feature_gene_selection function)
selected_genes <- seq_len(30)

## calculate size factor (see size_factor function)
size_factor_vec <- rep(1, nc)

calculate_CDI(
 X = X[selected_genes, ],
	cand_lab_df = label_df,
	cell_size_factor = size_factor_vec,
	batch_label = Batches)

## Input: SingleCellExperiment object
library(SingleCellExperiment)
sim_sce <- SingleCellExperiment(
  list(count = X),
  colData = data.frame(
    Cell_name = colnames(X),
		 batch = Batches),
	 rowData = data.frame(
	   Gene_name = rownames(X)))

calculate_CDI(
 	X = sim_sce,
 	feature_gene_index = selected_genes, 
 	cand_lab_df = label_df,
 	cell_size_factor = size_factor_vec,
 	count_slot = "count",
 	batch_slot = "batch")

## Input: Seurat object
library(Seurat)
library(SeuratObject)
sim_seurat <- CreateSeuratObject(counts = as.data.frame(X))
sim_seurat <- AddMetaData(sim_seurat, colnames(X), "Cell_name")
sim_seurat <- AddMetaData(sim_seurat, Batches, "batch")

calculate_CDI(
	X = sim_seurat,
	feature_gene_index = selected_genes,
	cand_lab_df = label_df,
	cell_size_factor = size_factor_vec,
 count_slot = "counts", 
 batch_slot = "batch")

## parallel computing
library(BiocParallel)
## single core
bp_object <- SerialParam()
## multi-cores
## bp_object  <- MulticoreParam(workers = 2)
calculate_CDI(
	X = X[selected_genes, ],
	cand_lab_df = label_df,
	cell_size_factor = size_factor_vec,
	batch_label = Batches,
	lrt_pval_threshold = 0.01,
	clustering_method = NULL,
	BPPARAM = bp_object)
             

jichunxie/CDI documentation built on Dec. 6, 2023, 5:48 a.m.