HierarchicalSparseCluster: Hierarchical sparse clustering

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

Performs sparse hierarchical clustering. If $d_ii'j$ is the dissimilarity between observations i and i' for feature j, seek a sparse weight vector w and then use $(sum_j (d_ii'j w_j))_ii'$ as a nxn dissimilarity matrix for hierarchical clustering.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
HierarchicalSparseCluster(x=NULL, dists=NULL,
method=c("average","complete", "single","centroid"),
wbound=NULL,niter=15,dissimilarity=c("squared.distance","absolute.value"),
 uorth=NULL,
silent=FALSE,cluster.features=FALSE,method.features=c("average", "complete",
"single","centroid"),output.cluster.files=FALSE,
outputfile.prefix="output",genenames=NULL,genedesc=NULL,standardize.arrays=FALSE)
## S3 method for class 'HierarchicalSparseCluster'
print(x,...)
## S3 method for class 'HierarchicalSparseCluster'
plot(x,...)

Arguments

x

A nxp data matrix; n is the number of observations and p the number of features. If NULL, then specify dists instead.

dists

For advanced users, can be entered instead of x. If HierarchicalSparseCluster has already been run on this data, then the dists value of the previous output can be entered here. Under normal circumstances, leave this argument NULL and pass in x instead.

method

The type of linkage to use in the hierarchical clustering - "single", "complete", "centroid", or "average".

wbound

The L1 bound on w to use; this is the tuning parameter for sparse hierarchical clustering. Should be greater than 1.

niter

The number of iterations to perform in the sparse hierarchical clustering algorithm.

dissimilarity

The type of dissimilarity measure to use. One of "squared.distance" or "absolute.value". Only use this if x was passed in (rather than dists).

uorth

If complementary sparse clustering is desired, then this is the nxn dissimilarity matrix obtained in the original sparse clustering.

standardize.arrays

Should the arrays be standardized? Default is FALSE.

silent

Print out progress?

cluster.features

Not for use.

method.features

Not for use.

output.cluster.files

Not for use.

outputfile.prefix

Not for use.

genenames

Not for use.

genedesc

Not for use.

...

not used.

Details

We seek a p-vector of weights w (one per feature) and a nxn matrix U that optimize

$maximize_U,w sum_j w_j sum_ii' d_ii'j U_ii'$ subject to $||w||_2 <= 1, ||w||_1 <= wbound, w_j >= 0, sum_ii' U_ii'^2 <= 1$.

Here, $d_ii'j$ is the dissimilarity between observations i and i' with along feature j. The resulting matrix U is used as a dissimilarity matrix for hierarchical clustering. "wbound" is a tuning parameter for this method, which controls the L1 bound on w, and as a result the number of features with non-zero $w_j$ weights. The non-zero elements of w indicate features that are used in the sparse clustering.

We optimize the above criterion with an iterative approach: hold U fixed and optimize with respect to w. Then, hold w fixed and optimize with respect to U.

Note that the arguments described as "Not for use" are included for the sparcl package to function with GenePattern but should be ignored by the R user.

Value

hc

The output of a call to "hclust", giving the results of hierarchical sparse clustering.

ws

The p-vector of feature weights.

u

The nxn dissimilarity matrix passed into hclust, of the form $(sum_j w_j d_ii'j)_ii'$.

dists

The (n*n)xp dissimilarity matrix for the data matrix x. This is useful if additional calls to HierarchicalSparseCluster will be made.

Author(s)

Daniela M. Witten and Robert Tibshirani

References

Witten and Tibshirani (2009) A framework for feature selection in clustering.

See Also

HierarchicalSparseCluster.permute,KMeansSparseCluster,KMeansSparseCluster.permute

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
  # Generate 2-class data
  set.seed(1)
  x <- matrix(rnorm(100*50),ncol=50)
  y <- c(rep(1,50),rep(2,50))
  x[y==1,1:25] <- x[y==1,1:25]+2
  # Do tuning parameter selection for sparse hierarchical clustering
  perm.out <- HierarchicalSparseCluster.permute(x, wbounds=c(1.5,2:6),
nperms=5)
  print(perm.out)
  plot(perm.out)
  # Perform sparse hierarchical clustering
  sparsehc <- HierarchicalSparseCluster(dists=perm.out$dists,
wbound=perm.out$bestw, method="complete")
  # faster than   sparsehc <- HierarchicalSparseCluster(x=x,wbound=perm.out$bestw, 
#  method="complete")
  par(mfrow=c(1,2))
  plot(sparsehc)
  plot(sparsehc$hc, labels=rep("", length(y)))
  print(sparsehc)
  # Plot using knowledge of class labels in order to compare true class
  #   labels to clustering obtained
  par(mfrow=c(1,1))
  ColorDendrogram(sparsehc$hc,y=y,main="My Simulated Data",branchlength=.007)
  # Now, what if we want to see if out data contains a *secondary*
  #   clustering after accounting for the first one obtained. We
  #   look for a complementary sparse clustering:
  sparsehc.comp <- HierarchicalSparseCluster(x,wbound=perm.out$bestw,
     method="complete",uorth=sparsehc$u)
  # Redo the analysis, but this time use "absolute value" dissimilarity:
  perm.out <- HierarchicalSparseCluster.permute(x, wbounds=c(1.5,2:6),
    nperms=5, dissimilarity="absolute.value")
  print(perm.out)
  plot(perm.out)
  # Perform sparse hierarchical clustering
  sparsehc <- HierarchicalSparseCluster(dists=perm.out$dists, wbound=perm.out$bestw, 
method="complete",
 dissimilarity="absolute.value")
  par(mfrow=c(1,2))
  plot(sparsehc)

Example output

Running sparse hierarchical clustering on unpermuted data
123456
Running sparse hierarchical clustering on permuted data
Permutation  1  of  5
123456
Permutation  2  of  5
123456
Permutation  3  of  5
123456
Permutation  4  of  5
123456
Permutation  5  of  5
123456

Tuning parameter selection results for Sparse Hierarchical Clustering:
  Wbound # Non-Zero W's Gap Statistic Standard Deviation
1    1.5              5        0.0594             0.0012
2    2.0              9        0.0801             0.0012
3    3.0             16        0.0953             0.0011
4    4.0             23        0.1000             0.0004
5    5.0             37        0.0960             0.0003
6    6.0             50        0.0790             0.0003
Tuning parameter that leads to largest Gap statistic:  4
1234567
Wbound is  4 :
Number of non-zero weights:  23
Sum of weights:  3.999973

123456789101112131415
Running sparse hierarchical clustering on unpermuted data
123456
Running sparse hierarchical clustering on permuted data
Permutation  1  of  5
123456
Permutation  2  of  5
123456
Permutation  3  of  5
123456
Permutation  4  of  5
123456
Permutation  5  of  5
123456

Tuning parameter selection results for Sparse Hierarchical Clustering:
  Wbound # Non-Zero W's Gap Statistic Standard Deviation
1    1.5              4        0.0307             0.0016
2    2.0              9        0.0369             0.0011
3    3.0             14        0.0401             0.0006
4    4.0             23        0.0406             0.0004
5    5.0             33        0.0381             0.0004
6    6.0             50        0.0281             0.0002
Tuning parameter that leads to largest Gap statistic:  4
123456

sparcl documentation built on May 1, 2019, 9:20 p.m.