cluster_assessment: AssessME - a cluster assessment tool for preprocessing and...

View source: R/AssessMe.R

cluster_assessmentR Documentation

AssessME - a cluster assessment tool for preprocessing and clustering optimisation

Description

tool for assessment and comparison of cluster partitions based on different:filtering, feature selection, normalization, batch correction, imputation, clustering algorithms

Usage

cluster_assessment(
  assessment_list = NULL,
  seuratobject = NULL,
  seurat_assay = "RNA",
  seurat_lib_size = F,
  do.features = T,
  var_feat_len = NULL,
  RaceIDobject = NULL,
  RaceID_cl_table = NULL,
  ScanpyobjectFullpath = NULL,
  scanpy_clust = "leiden",
  scanpyscalefactor = 10000,
  rawdata = NULL,
  ndata = NULL,
  norm = T,
  givepart = NULL,
  givefeatures = NULL,
  minexpr = 5,
  CGenes = NULL,
  ccor = 0.65,
  fselectRace = F,
  fselectSeurat = F,
  givebatch = NULL,
  individualbatch = NULL,
  gene.domain = F,
  PCA_QA = F,
  PCAnum = 10,
  run_cutoff = T,
  f1Z = F,
  cutoff = "mean",
  cutoffmax = F,
  clustsize = 10,
  binaclassi = "F1Score",
  Entro_tresh = T,
  Entro_med = T,
  run_enriched = T,
  give2ndfiff = T,
  diffexp = "nbino",
  vfit = NULL,
  gooutlier = T,
  individualfit = F,
  outminc = 5,
  probthr = 0.01,
  diptest = T,
  bwidth = T,
  critmass = T,
  mintotal = 3000,
  unifrac = 0.1,
  logmodetest = F,
  b_bw = 25,
  n_bw = 128,
  b_ACR = 100,
  n_ACR = 1024,
  batch_entropy = F,
  set.name = NULL,
  rawdata_null = T
)

Arguments

assessment_list

list, with named objects for different assessments, to which new assessment is added. Default is NULL.

seuratobject

Seurat object as input for assessment: derives UMI count object, normalized count object, cluster partition and variable features from Seurat Object. Default = NULL.

seurat_assay

if seuratobject, name of Seurat assay to retrieve required objects. Default =”RNA”

seurat_lib_size

logical. If FALSE performs library size normalization of UMI counts object of seuratobject and overwrites normalized data object within assessment object. Default = FALSE.

do.features

logical. If TRUE performs feature selection and derives var_feat_len number of top variable genes. Default = TRUE.

var_feat_len

number of top variable genes used for cluster assessment, if var_feat_len not equivalent of the length of "var.features" object of seuratobject, derive top var_feat_len number of feature genes using Seurat’s variance stabilization method, requires seuratobject and do.features needs to be set TRUE. Default = NULL.

RaceIDobject

RaceID object as input for assessment: derives UMI count data of cells passing filtering criteria, normalized data, cluster partition, feature genes, background noise model describing the expression variance of genes as a function of their mean and RaceID filtering criteria. Default = NULL.

RaceID_cl_table

metadata data frame for a RaceID object in similar form as meta.data object of a Seurat object with rows as cells and columns as e.g. different cluster partitions. Default = NULL.

ScanpyobjectFullpath

full path to scanpy object in h5ad format, which is converted to Seurat object from which UMI counts, cluster partition and feature genes are derived. Using UMI count data and scale factor, library size normalization is performed and scaled using the scale factor.

scanpy_clust

either “leiden” or “louvain”, derives cluster partition of either Leiden or Louvain clustering. Default=”leiden”.

scanpyscalefactor

integer number with which relative cell counts are scaled to equal transcript counts. Default = 10,000.

rawdata

UMI count expression data with genes as rows and cells as columns. Default = NULL.

ndata

normalized expression data with genes as rows and cells as columns. Default = NULL.

norm

performs library size normalization on provided rawdata argument. Default = TRUE.

givepart

clustering partition. Either a vector of integer cluster number for each cell in the same order as UMI count table or normalized count table for RaceIDobject; or a character string representing a column name of Seurat metadata data frame of a Seurat object or similar metadata frame, RaceID_cl_table,for a RaceID object. Default = NULL.

givefeatures

gene vector to perform assessment. Default = NULL.

minexpr

minimum required transcript count of a gene across evaluated cells. Genes not passing criteria are filtered out. Default 5. If RaceIDobject, minexpr derived from RaceIDobject. Relevant for deriving feature genes if gene.domain and calculating fit of dependency of mean on variance.

CGenes

gene vector for genes to exclude from feature selection. Only relevant if seuratobject & RaceIDobject & ScanpyobjectFullpath = NULL and rawdata is given. Default = NULL.

ccor

integer value of correlation coefficient used as threshold for determining genes correlated to genes in CGenes. Only genes correlating less than ccor to all genes in CGenes are retained for analysis. Default = 0.65.

fselectRace

logical. If True, performs RaceID feature selection, only if seuratobject & RaceIDobject & ScanpyobjectFullpath & givefeatures = NULL. Default = False.

fselectSeurat

logical. If True,performs Seurat variance stabilization feature selection and derives var_feat_len number of top variable genes, only if seuratobject & RaceIDobject & ScanpyobjectFullpath & givefeatures = NULL. Default = False.

givebatch

vector indicating batch information for cells; must have the same length and order as cluster partition. Default = NULL.

individualbatch

individual batch name, element of givebatch, to perform assessment on. Default = NULL.

gene.domain

logical. If TRUE, assess all genes with at least minexpr in one cell.

PCA_QA

logical. If TRUE, derives first two principal components and the top PCAnum number of genes with highest or lowest loadings. Default = False.

PCAnum

integer value, number of genes to be derived with top highest and top lowest loadings for the first two principal components. Default = 10.

run_cutoff

logical. If TRUE calculate per gene cutoff, representing true label utilized for F1 score, entropy and enrichment of gene per cluster calculation. Default = T.

cutoff

either “mean” or “median”, utilizes either per gene average expression within clusters or per gene median expression within clusters to calculate the true label cutoff. The Cutoff is calculated per gene by selecting the cluster with highest average or median expression and averaging this mean, with the mean or median of the remaining clusters.

cutoffmax

logical. If TRUE, then per gene cutoff is the average expression of the cluster with highest average expression across clusters. Default = False.

clustsize

integer value, threshold of minimum number of cells a cluster should have to be included in the assessment.

binaclassi

either “F1Score”, “Cohenkappa”, “MCC” or NULL. Statistical analysis for binary classification. F1Score, Cohenkappa or Matthews correlation coefficient (MCC). If NULL then computation is skipped. Default = “F1Score”.

Entro_tresh

logical. If TRUE, calculate per gene entropy, utilizing the derived per gene cutoff as true-label, to assess label distribution across clusters. Default = TRUE, requires run_cuoff.

Entro_med

logical. If TRUE, calculate per gene median expression per cluster and fraction of individual median of summed medians across clusters, which is used to calculate per gene entropy. Default = F, requires run_cuoff.

run_enriched

logical. If TRUE, run enrichment analysis using fisher.test. Using cutoff, expression per gene is binarized across cells. Cells have either 1 or 0 expression. Expression is summed within clusters and enrichment per cluster is calculated for each gene using fisher.test. If cluster has enrichment for a gene( p-value < 0.05), the value per gene of a cluster is set to 1. In order to speed up computation, for each gene, fraction of positive cells within a cluster is ordered in decreasing order and enrichment is tested iterativelly along that order. If enrichment p-value of 3 clusters (flag count) is not significant, the remaining clusters are expected to be not enriched. Cluster with less cells than the number of average cells per cluster do not increase the flag count.

give2ndfiff

logical. If TRUE, run differential expression analysis between every cluster and its closest cluster(s) based on highest number of co-enriched genes, for genes which are shared enriched in these clusters. If more than one cluster share the same number of co-enriched genes, differential expression of co-enriched genes is performed for all co-enriched clusters. Default = T. Co-enriched clusters can represent cell states of the same cell types.

diffexp

either “nbino” or “wilcox”. Performs differential expression analysis between cells of clusters with highest number of co-enriched genes for these co-enriched genes based on Wilcoxon test or negative binomial distribution test utilizing global gene mean-variance dependence. Default = “nbino”.

vfit

function of the background noise model describing the expression variance as a function of the mean expression. Input can be utilized for differential expression analysis between co-enriched genes and identification of outlier gene-expression within cluster in outlier analysis. Default = NULL.

gooutlier

logical. If TRUE, performs outlier identification based on cluster partition and identifies outlier gene expression within clusters.

individualfit

logical. If TRUE, background noise model, required to infer outlier expression, is fitted for each cluster separately, default = F.

outminc

integer value, minimal transcript count of a gene to be included in the background fit.

probthr

integer value, outlier probability threshold for genes to exhibit outlier expression within a cluster. Probability is computed from a negative binomial background model of expression in a cluster.

diptest

logical. If T, performs dip.test function from the diptest package to test for unimodality of gene expression (enriched genes) within clusters by computing Hartigans’ dip statistics per gene. Calculation is performed only on expression values with at least minexpr. As calculating is performed on library size normalized and rescaled data, minexpr is rescaled basd on scalefactor divided by mintotal. Expression is only tested, if a given fraction of a cluster, unifrac, exhibits minimal expression of rescaled minexpr or the sample size equals at least clustsize.Default = T.

bwidth

logical. If T, performs Silverman’s critical bandwidth method to test for unimodality of gene expression (enriched genes) within clusters. Calculation is performed only on expression values with at least minexpr. As calculating is performed on library size normalized and rescaled data, minexpr is rescaled based on scalefactor divided by mintotal. Expression is only tested, if a given fraction of a cluster, unifrac, exhibits minimal expression of rescaled minexpr or the sample size equals at least clustsize.Default = T.

critmass

logical. If T, performs Ameijeiras-Alonsos’s method to test for unimodality of gene expression (enriched genes) within clusters. Calculation is performed only on expression values with at least minexpr. As calculating is performed on library size normalized and rescaled data, minexpr is rescaled based on scalefactor divided by mintotal. Expression is only tested, if a given fraction of a cluster, unifrac, exhibits minimal expression of rescaled minexpr or the sample size equals at least clustsize.Default = T.

mintotal

minimal number of transcripts cells are expected to have, to calculate expression cutoff. Default = 3000

unifrac

fraction of cluster required to exhibit at least scaled minexpr that gene is tested for unimodality. Default = 0.1.

logmodetest

logical. If T, performs log transformation before testing unimodality of gene expression. Default = F.

b_bw

number of replicates used for Silverman’s critical bandwith test, default = 25.

n_bw

number of equally spaced points at which density is estimated, for Silverman’s critical bandwith test, default = 128.

b_ACR

number of replicates used for Ameijeiras-Alonsos’s unimodality test, default = 100.

n_ACR

number of equally spaced points at which density is estimated, for Ameijeiras-Alonsos’s unimodality test, default = 1024.

set.name

set name for individual assessment within output of list of assessments. Default = NULL and name is given in the following way: if seuratobject, name is selected from metadata columns equal to Idents(), or character string given as input for givepart or character string of object name of numeric cluster partition. If RaceIDobject, name is given by character string given as input for givepart, character string of the object name of the number cluster partition or “Vdefault”.

rawdata_null

logical. If TRUE, do not store UMI count table in output of assessment, default = T

logical.

If TRUE than cutoff for true label is x>0. Default = False.

batch_entrop

logical. If T, calculate the entropy of batches across cluster. Default = F.

Value

List of assessments, with a named object per assessment. Individual assessments represent a list with the following objects:

rawdata

Raw expression data matrix/UMI count matrix derived from input objects, with cells as columns and genes as rows in sparse matrix format.

rowmean

mean expression of assessed features.

part

vector containing cluster partition derived from input objects.

clustsize

threshold of minimum number of cells in a cluster used for assessment.

features

vector of feature genes derived from object, used to compute its cluster partition.

assessed_features

vector of features assessed through assess me function, can differ from features when var_feat_len argument differs from length of object derived features or different set of genes given as argument with givefeatures

PCA

data.frame with 4 columns, indicating top PCAnum genes with: highest loadings for PC1, lowest loadings for PC1, highest loadings for PC2 and lowest loadings for PC2.

max_cl

vector indicating for assessed features which cluster exhibits highest mean expression.

cutoff

vector indicating calculated numeric cutoff for assessed features.

f1_score

vector indicating f1_score or alternative statistical analysis for binary classification, for the assessed features.

Entropy_tresh

vector indicating Entropy per assessed feature, calculated based on the per gene cutoff.

Entropy_median

ector indicating Entropy per assessed feature, calculated based on per gene median expression per cluster and fraction of individual medians of summed median across clusters.

cluster

vector indicating assessed clusters.

enriched_features

number of enriched features per cluster.

enriched_feature_list

list with a vector per cluster of enriched features.

unique_features

number of uniquely enriched features per cluster.

unique_feature_list

list with a vector per cluster of uniquely enriched features.

second_cluster

data.frame with rows representing a cluster and its closest clusters based on co-enriched genes and columns representing: "frac_shared_to_clos_cluster” = number of co-enriched genes,“rel_frac_shared_to_clos”: fraction of co-enriched genes of enriched genes,“frac_diff_of_shared_features “: number of differential genes of co-enriched genes,“rel_frac_diff_of_shared_to_clos”: fraction of differential genes of co-enriched genes

list_2ndShared

list with data.frame for every cluster with rows as enriched genes of a cluster and columns representing binary classification for enrichment (1= enriched, 0 = not enriched) of a cluster and its most similar clusters based on co-enriched genes.

shared2ndgenes

list with vector for every cluster of enriched genes with co-enrichment in closest clusters.

list_2nd_diff

list with vector for every cluster of co-enriched genes with differential expression to co-enriched clusters.

outliertab

data.frame indicating number of outlier cells per cluster with 1, 2 or 3 outlier genes. Rows representing cluster and columns representing number of cells with 1, 2 or 3 outlier genes.

outlier_genes

list with vector for every clusters indicating outlier genes.

nonunimodal_list

list with data.frame per cluster with rows representing enriched gene per cluster and columns p.value of dip.test and p.value after multiple testing correction with Bonferroni and BH method.

nonunimodaltab

data.frame indicating number of genes per cluster with non-unimodal expression before and after multiple-testing correction.

bandwidth_list

list with vector for every cluster indicating genes with non-unimodal expression derived from Silverman’s critical bandwith test.

masstest_list

list with vectors for every cluster indicating gene with non-unimodal expression based on Ameijeiras-Alonsos’s method to test for unimodality.

batch_entropy

entropy of batches across clusters

Examples

entero <- CreateSeuratObject(counts = x, project = "10x", min.cells = 3, min.features = 200)
entero <- NormalizeData(entero, normalization.method = "RC", scale.factor = 10000)
entero <- FindVariableFeatures(entero, selection.method = "vst", nfeatures = 3000)
features <- Seurat::VariableFeatures(entero)
entero <- ScaleData(entero, features = features)
entero <- RunPCA(entero, features = features, npcs = 100)
entero <- FindNeighbors(entero, dims = 1:100)
resolution <- c(1:10)
for (i in resolution)  { entero  <- FindClusters(entero , resolution = i) }
res <- colnames(entero[[]])[c(4,6:length(colnames(entero[[]])))]
for (i in 1:length(res)) {if (i == 1) { assess_seuratRC <- cluster_assessment( seuratobject=entero,givepart = res[i], give2ndfiff=F, Entro_med=F, diptest=F, run_enriched=T, bwidth=F, critmass=F, gooutlier=T) } else { assess_seuratRC <- cluster_assessment(assessment_list = assess_seuratRC, seuratobject=entero,givepart = res[i], give2ndfiff=F, Entro_med=F, diptest=F, run_enriched=T, bwidth=F, critmass=F, gooutlier=T) }}

PatZeis/AssessMe documentation built on Nov. 19, 2022, 6:03 a.m.