UnsupRFhclust: Unsupervised random forest with hclust: cluster data and...

Description Usage Arguments Details Value References Examples

Description

This function takes a dissimilarity matrix, such as the Random Forest dissimilarity matrix from RFdist and contructs a hirearchical clustering object using the hlust package. It then evaluates the predictive ability of different clusterings k = 2:K by predicting a binary response variable based on cluster memberships. The results can be used to validate and select the best number of clusters. See Ngufor et al.

It takes a standard formula, a data matrix dat containing the binary response, and a disimilarity matrix rfdist derived from dat and computes the AUCs for three logistic regression models: (1) a model with the provided predictors in the formula, (2) a model with k = 1:K clusters generated from the hclust object, and (3) a model with k=1:K randomly generated clusters. This is then repeated over cv cross-validation. See example below.

Note that dat must be used to compute the dissimilarity matrix and passed to the function unchanged. Otherwise there is no guarantee that the method will accurately assign clusters to the right observations in dat. Perhaps there might be a way to check for this?

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
UnsupRFhclust(formula, dat, rfdist, hclust.method = "ward.D2", K = 10,
  parallel = TRUE, cv = 5, mc.cores = 2, seed = 12345,
  verbos = TRUE, ...)

UnsupRFhclust.caret(formula, dat, rfdist, hclust.method = "ward.D2",
  K = 10, parallel = TRUE, cv = 5, mc.cores = 2, seed = 12345,
  caret.method = "glm", fitControl = trainControl(method = "none",
  classProbs = TRUE), verbos = TRUE, ...)

cluster_internal_validation(K = 10, Hclust.model, RFdist,
  parallel = TRUE, mc.cores = 2, seed = 1234, ...)

Arguments

formula

an R formuala. Note, only binary outcomes are currently supported.

dat

data matrix

rfdist

dissimilarity matrix, such as the RF distance matrix computed using RFdist based on the dat data matrix.

hclust.method

the agglomeration method to be use see link{hlust}.

K

maximum number of clusters to generate for hierarchical clustering model

parallel

(logical) run in parallel ?

cv

number of cross-validation to perform using data, must be at least 2

mc.cores

number of cores

seed

random seed

...

further arguments passed to or from other methods.

caret.method

classifcation method to use in caret: "glm" and "rf" currently tested.

fitControl

Control the computational nuances of the caret train function. See the caret package

Hclust.model

hierarchical clustering model

RFdist

RF distance matrix computed from RFdist.

Details

  1. UnsupRFhclust evaluates the predictive strength of the clusters using base glm directly while UnsupRFhclust.caret uses glm through the caret package (caret package required). This offers an interphase to use different classifcation models in the caret package.

  2. The function cluster_internal_validation takes a hierarchical clustering model and a dissimilarity matrix (e.g output of RFdist, and runs the cluster.stats function in the fpc package for K different number of clusters and compute several internal cluster validation metrics and selects the best number of clusters by majority rule

Value

a data frame with columns:

  1. UnsupRFhclust and UnsupRFhclust.caret

    1. Hclust: hierarchical clustering model

    2. perf: a data frame with columns: AUC, cluster (cluster numbers), CV (cross-validation number), and type (for one of the three types of models mention in the description)

  2. cluster_internal_validation

    1. clusters: cluster memberships for the optimal number of clusters (out put of cluster_internal_validation)

    2. kopt: best number of clusters obtained by majprity rule on the table results below.

    3. table: matrix of internal validation metrics (out put of cluster_internal_validation) with columns

      • sep: cluster seperation. Higher values indicates better clustering

      • toother: sum of the average distances of a point in a cluster to points in other clusters. Higher values indicates better clustering

      • within : sum of average distance within clusters: Smaller values indicates better clustering

      • between: average distance between clusters. Higher values indicates better clustering

      • ss: within clusters sum of squares. Smaller values indicates better clustering

      • silwidth: average silhouette width. Higher values indicates better clustering. See silhouette.

      • dunn: Dunn index. Higher values indicates better clustering

      • dunn2: Another version of Dunn index

      • wb.ratio: (negative) ratio of average distance between clusters to average distance within clusters. Smaller values indicates better clustering (or positive and large value is better)

      • ch: Calinski and Harabasz index. Higher values is better

      • entropy: negative entropy. Smaller values are better

      • w.gap: sum of vector of widest within-cluster gaps. Small is better

References

Ngufor, C., Warner, M. A., Murphree, D. H., Liu, H., Carter, R., Storlie, C. B., & Kor, D. J. (2017). Identification of Clinically Meaningful Plasma Transfusion Subgroups Using Unsupervised Random Forest Clustering. In AMIA Annual Symposium Proceedings (Vol. 2017, p. 1332). American Medical Informatics Association.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
## Not run: 
require(plyr)
require(ggplot2)
data(iris)
dat <- iris
 # get Random forest dissimilarity matrix 
RF.dist <- RFdist(data=dat[, -5], ntree = 10, no.rep=20, 
           syn.type = "permute", importance= FALSE)
form <- as.formula(paste0("Species ~ ", 
               paste0(setdiff(names(dat),c("Species")),collapse = "+")))

 # UnsupRFhclust 
res <- UnsupRFhclust(formula=form, dat=dat, rfdist = RF.dist$RFdist, K =20,  
                     parallel = FALSE, cv = 5, seed = 123)
tb <- ddply(res$perf, .variables = c("cluster", "type"), .fun = numcolwise(mean) )
 pp <- ggplot( ) +
 geom_line(data = tb, 
 aes(x = cluster , y = AUC, colour = type, linetype = type), size = 1.3) + 
 scale_color_manual(values= c("darkgreen", "darkred", "blue")) + 
 geom_vline(xintercept = 3, colour = "darkgreen") + 
 scale_x_continuous(name="Number of clusters",breaks=2:30) + ylab("AUC") +  
 ylim(c(0.55, 1)) +  
 theme(axis.title.x=element_text(size=14,face="bold"), 
 axis.title.y=element_text(size=14,face="bold"),
 legend.text = element_text(size=14,face="bold"), 
 axis.text.x = element_text(size = 13, face="bold",colour = "gray40"),
 legend.title = element_text(size=14,face="bold"),
 axis.text.y = element_text(size = 13, face="bold",colour = "gray40")) + 
 scale_linetype_manual(values=c("solid", "solid", "dotted"))
print(pp)

3 appears to be the best number of clusters 
clusters <- cutree(res$Hclust, 3)  

# cluster_internal_validation example  
# get hclust object from  UnsupRFhclust 
  HCmod = res$Hclust 
 rr <- cluster_internal_validation(K = 20, Hclust.model = HCmod,  
  RFdist = RF.dist$RFdist, seed = 1234)
 rr$table  
 rr$kopt 

## End(Not run)

nguforche/UnsupRF documentation built on May 5, 2019, 4:51 p.m.