iterClust: Iterative Clustering
In hd2326/iterClust: Iterative Clustering

Description Usage Arguments Details Value Author(s) Examples

A framework for performing clustering analysis iteratively

iterClust(dset, maxIter = 10, minFeatureSize = 100,
  featureSelect = iterClust::featureSelect, minClustSize = 10,
  coreClust = iterClust::coreClust, clustEval = iterClust::clustEval,
  clustHetero = iterClust::clustHetero, obsEval = iterClust::obsEval,
  obsOutlier = iterClust::obsOutlier)

`dset`	(numeric matrix or data.frame) features in rows and observations in columns, or SummarizedExperiment0 and ExpressionSet object
`maxIter`	(positive integer) specifies maximum number iterations to be performed
`minFeatureSize`	(positive integer) specifies minimum number of features needed
`featureSelect`	(function) takes a dataset, depth(IV) and cluster$feature(IV), returns a character array, containing features used for clustering analysis
`minClustSize`	(positive integer) specifies minimum cluster size
`coreClust`	(function) takes a dataset and depth(IV), returns a list, containing clustering vectors under different clustering parameters
`clustEval`	(function) takes a dataset, depth(IV) and coreClust result, returns a numeric vector, evaluating the robustness (higher value means more robust) of each clustering scheme
`clustHetero`	(function) takes depth(IV) and clustEval result, returns a boolean vector, deciding whether a cluster is considered as heterogenous
`obsEval`	(function) takes a dataset and optimal coreClust result determined by clustEval, returns a numeric vector, evaluating the clustering robustness of each observation
`obsOutlier`	(function) takes depth(IV) and obsEval result, returns a boolean vector, deciding whether an observation is outlier

#################### General Idea ####################

In a scenario where populations A, B1, B2 exist, pronounce differences between A and B may mask subtle differences between B1 and B2. To solve this problem, so that heterogeneity can be better detected, clustering analysis needs to be performed iteratively, so that, for example, in iteration 1, A and B are seperated and in iteration 2, B1 and B2 are seperated.

#################### General Work Flow ####################

ith Iteration Start ==>>

featureSelect (feature selection) ==>>

minFeatureSize (confirm enough features are selected) ==>>

clustHetero (confirm heterogeneity) ==>>

coreClust (generate several clustering schemes to be evaluated) ==>>

clustEval (pick optimal clustering scheme generated in previous step) ==>>

minClustSize (remove clusters with few observations) ==>>

obsEval (evaluate how each observations are clustered) ==>>

obsOutlier (remove poorly clustered observations) ==>>

results in Internal Variables (IV) ==>>

ith Iteration End

#################### Internal Variables (IV) ####################

The following IVs are used in user-defined functions in each iteration:

cluster: (list) the return value, described in "Value" section

depth: (numeric) current round of iteration

a list with the following structure containing iterClust result

–> $cluster (list) $Iter[i] (list) $Cluster[j], (character array) names of observations belong to each cluster

–> $feature (list) $Iter[i] (list) $Cluster[j]inIter[i-1], (character array) features used to split each cluster in the previous iteration thereby produce the current clusters

–> $clusterScore (list) $Iter[i] (list) $Cluster[j]inIter[i-1], (numeric array) clustEval output for each clustering schemes

–> $observationScore (list) $Iter[i] (list) $Cluster[j]inIter[i-1], (numeric array) obsEval output for each samples

DING, HONGXU (hd2326@columbia.edu)

library(tsne)
library(cluster)
library(bcellViper)

data(bcellViper)
exp <- exprs(dset)
pheno <- as.character(dset@phenoData@data$description)
exp <- exp[, pheno %in% names(table(pheno))[table(pheno) > 5]]
pheno <- pheno[pheno %in% names(table(pheno))[table(pheno) > 5]]
#load bcellViper expression and phenotype annotation

c <- iterClust(exp, maxIter=3, minClustSize=5)
#iterClust

dist <- as.dist(1 - cor(exp))
set.seed(1)
tsne <- tsne(dist, perplexity = 20, max_iter = 500)#' 
for (j in 1:length(c$cluster)){
    COL <- structure(rep(1, ncol(exp)), names = colnames(exp))
    for (i in 1:length(c$cluster[[j]])) COL[c$cluster[[j]][[i]]] <- i+1
    plot(tsne[, 1], tsne[, 2], cex = 0, cex.lab = 1.5,
         xlab = "Dim1", ylab = "Dim2",
         main = paste("iterClust, iter=", j, sep = ""))
    text(tsne[, 1], tsne[, 2], labels = pheno, cex = 0.5, col = COL)
    legend("topleft", legend = "Outliers", fill = 1, bty = "n")}
#visualize results