Description Usage Arguments Details Value References Examples
This function takes a dissimilarity matrix, such as the Random Forest dissimilarity matrix
from RFdist
and contructs a hirearchical clustering object using the hlust
package.
It then evaluates the predictive ability of different clusterings k = 2:K by
predicting a binary response variable based on cluster memberships. The results can be used to
validate and select the best number of clusters. See Ngufor et al.
It takes a standard formula
, a data matrix dat
containing the binary response, and a
disimilarity matrix rfdist
derived from dat
and computes the AUCs for
three logistic regression models:
(1) a model with the provided predictors in the formula, (2) a model with
k = 1:K clusters generated from the hclust
object,
and (3) a model with k=1:K randomly generated clusters.
This is then repeated over cv
cross-validation. See example below.
Note that dat
must be used to compute the dissimilarity matrix and passed to the function unchanged.
Otherwise there is no guarantee that the method will accurately assign clusters to the right observations
in dat
. Perhaps there might be a way to check for this?
1 2 3 4 5 6 7 8 9 10 11 | UnsupRFhclust(formula, dat, rfdist, hclust.method = "ward.D2", K = 10,
parallel = TRUE, cv = 5, mc.cores = 2, seed = 12345,
verbos = TRUE, ...)
UnsupRFhclust.caret(formula, dat, rfdist, hclust.method = "ward.D2",
K = 10, parallel = TRUE, cv = 5, mc.cores = 2, seed = 12345,
caret.method = "glm", fitControl = trainControl(method = "none",
classProbs = TRUE), verbos = TRUE, ...)
cluster_internal_validation(K = 10, Hclust.model, RFdist,
parallel = TRUE, mc.cores = 2, seed = 1234, ...)
|
formula |
an R formuala. Note, only binary outcomes are currently supported. |
dat |
data matrix |
rfdist |
dissimilarity matrix, such as the RF distance matrix computed using |
hclust.method |
the agglomeration method to be use see |
K |
maximum number of clusters to generate for hierarchical clustering model |
parallel |
(logical) run in parallel ? |
cv |
number of cross-validation to perform using data, must be at least 2 |
mc.cores |
number of cores |
seed |
random seed |
... |
further arguments passed to or from other methods. |
caret.method |
classifcation method to use in caret: "glm" and "rf" currently tested. |
fitControl |
Control the computational nuances of the caret |
Hclust.model |
hierarchical clustering model |
RFdist |
RF distance matrix computed from |
UnsupRFhclust
evaluates the predictive strength of the clusters using base glm
directly while UnsupRFhclust.caret
uses glm through the caret package (caret package required). This offers
an interphase to use different classifcation models in the caret package.
The function cluster_internal_validation
takes a hierarchical clustering model and a
dissimilarity matrix (e.g output of RFdist
, and runs the cluster.stats function in the
fpc
package for K different number of clusters and compute several
internal cluster validation metrics and selects the best number of clusters by majority rule
a data frame with columns:
UnsupRFhclust and UnsupRFhclust.caret
Hclust: hierarchical clustering model
perf: a data frame with columns: AUC, cluster (cluster numbers), CV (cross-validation number), and type (for one of the three types of models mention in the description)
cluster_internal_validation
clusters: cluster memberships for the optimal number of clusters (out put of cluster_internal_validation)
kopt: best number of clusters obtained by majprity rule on the table results below.
table: matrix of internal validation metrics (out put of cluster_internal_validation) with columns
sep: cluster seperation. Higher values indicates better clustering
toother: sum of the average distances of a point in a cluster to points in other clusters. Higher values indicates better clustering
within : sum of average distance within clusters: Smaller values indicates better clustering
between: average distance between clusters. Higher values indicates better clustering
ss: within clusters sum of squares. Smaller values indicates better clustering
silwidth: average silhouette width. Higher values indicates better clustering. See silhouette
.
dunn: Dunn index. Higher values indicates better clustering
dunn2: Another version of Dunn index
wb.ratio: (negative) ratio of average distance between clusters to average distance within clusters. Smaller values indicates better clustering (or positive and large value is better)
ch: Calinski and Harabasz index. Higher values is better
entropy: negative entropy. Smaller values are better
w.gap: sum of vector of widest within-cluster gaps. Small is better
Ngufor, C., Warner, M. A., Murphree, D. H., Liu, H., Carter, R., Storlie, C. B., & Kor, D. J. (2017). Identification of Clinically Meaningful Plasma Transfusion Subgroups Using Unsupervised Random Forest Clustering. In AMIA Annual Symposium Proceedings (Vol. 2017, p. 1332). American Medical Informatics Association.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | ## Not run:
require(plyr)
require(ggplot2)
data(iris)
dat <- iris
# get Random forest dissimilarity matrix
RF.dist <- RFdist(data=dat[, -5], ntree = 10, no.rep=20,
syn.type = "permute", importance= FALSE)
form <- as.formula(paste0("Species ~ ",
paste0(setdiff(names(dat),c("Species")),collapse = "+")))
# UnsupRFhclust
res <- UnsupRFhclust(formula=form, dat=dat, rfdist = RF.dist$RFdist, K =20,
parallel = FALSE, cv = 5, seed = 123)
tb <- ddply(res$perf, .variables = c("cluster", "type"), .fun = numcolwise(mean) )
pp <- ggplot( ) +
geom_line(data = tb,
aes(x = cluster , y = AUC, colour = type, linetype = type), size = 1.3) +
scale_color_manual(values= c("darkgreen", "darkred", "blue")) +
geom_vline(xintercept = 3, colour = "darkgreen") +
scale_x_continuous(name="Number of clusters",breaks=2:30) + ylab("AUC") +
ylim(c(0.55, 1)) +
theme(axis.title.x=element_text(size=14,face="bold"),
axis.title.y=element_text(size=14,face="bold"),
legend.text = element_text(size=14,face="bold"),
axis.text.x = element_text(size = 13, face="bold",colour = "gray40"),
legend.title = element_text(size=14,face="bold"),
axis.text.y = element_text(size = 13, face="bold",colour = "gray40")) +
scale_linetype_manual(values=c("solid", "solid", "dotted"))
print(pp)
3 appears to be the best number of clusters
clusters <- cutree(res$Hclust, 3)
# cluster_internal_validation example
# get hclust object from UnsupRFhclust
HCmod = res$Hclust
rr <- cluster_internal_validation(K = 20, Hclust.model = HCmod,
RFdist = RF.dist$RFdist, seed = 1234)
rr$table
rr$kopt
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.