\renewcommand{\vec}[1]{\boldsymbol{#1}}
knitr::opts_chunk$set(echo = TRUE)
In this note, we describe the role of weights in convex clustering and
biclustering, the default weight scheme used in CARP
and CBASS
, and how to
add implement custom weighting schemes in clustRviz.
Recall the convex clustering problem, as discussed in Hocking et al. [-@Hocking:2011], Chi and Lange [-@Chi:2015], and Tan and Witten [-@Tan:2015], among others:
[ \underset{\vec U}{\textrm{minimize}} \;\; \frac{1}{2} \| \vec X - \vec U \|F^2 + \lambda \sum{l < m} w_{l,m} \| \vec u_l - \vec u_m \|_q ]
We typically take $q = 2$, though both CARP
and CBASS
support $q = 1$ as well.
It is well-known that, as $\lambda$ increases, the solution to this problem traces out a continuous set of clustering solutions,^[The so-called "cluster-path" in the terminology of Hocking et al. [-@Hocking:2011].] ranging from each observation being its own cluster at $\lambda = 0$ to all observations being fused into a single cluster as $\lambda$ becomes large. Less well-studied in the literature is the effect of the weight terms ${w_{l, m}}_{1 \leq l < m \leq n}$. Hocking et al. [-@Hocking:2011] noted that a Gaussian kernel weighting scheme of the form
[w_{l, m} \propto \exp\left(-\phi \|\vec x_l - \vec x_m\|_2^2\right)]
performed well in experiments, giving a clustering solution that respected the density of the data. Chi and Lange [-@Chi:2015] confirm the usefulness of this weighting scheme, and note that, for realistic data sets, many of these weights are quite small. To improve computational efficiency, they recommend dropping many of the smallest weights, reducing the penalty (fusion) term from a sum of $\choose{n}{2}$ terms to a much more tractable size. Since the dropped weights are typically several orders of magnitude smaller than the retained weights, omitting these terms has minimal impact on the resulting solution.
Our experience developing CARP
and CBASS
is similar: the Gaussian kernel weights
work well in practice, and omitting smaller weights leads to significant increases
in computational efficiency. For fully interactive data analysis, we find that
using a relatively sparse set of weights is important, even with the speed advantages
given by CARP
and CBASS
.
We note two properties that any sensible weighting scheme must satisfy and which are enforced internally by our software:
If the latter condition is not satisfied, the data are never fused into a single cluster, even as $\lambda \to 0$, so our algorithms never terminate. Relatedly, if the second condition is not satisfied, the clustering problem is fully separable across groups of observations and data from different parts of the graph are "in the dark" about each other: the same solution could be obtained by clustering the two sets of observations separately, which is rarely what we want in a clustering algorithm.
Chi, Allen, and Baraniuk [-@Chi:2017] propose a convex formulation of biclustering:
[ \underset{\vec U}{\textrm{minimize}} \;\; \frac{1}{2} \| \vec X - \vec U \|F^2 + \lambda \left(\sum{l < m} w_{l,m} \| \vec U_{l\cdot} - \vec U_{m\cdot} \|q + \sum{j < k} w_{j, k} \|\vec U_{\cdot j} + \vec U_{\cdot k}\|_q\right) ]
Here we doubly fuse our data: combining row-wise fusion (the first part of the penalty term) to cluster observations with column-wise fusion (the second part of the penalty term) to cluster features. The resulting problem induces biclustering; see Chi et al. [-@Chi:2017] for more details.
As Chi et al. note, an additional requirement on the weights is necessary to obtain sensible solutions:^[Note that Chi et al. require the column weights to sum to $n^{-1/2}$ and the row weights to sum to $p^{-1/2}$, because their data matrix $X \in \mathbb{R}^{p \times n}$ is the transpose of ours.]
This behavior is automatically enforced by CBASS
. That is, after the weights are
calculated, they are (silently) renormalized to have an appropriate sum.
clustRviz
's Default Weighting SchemeAs noted above, "sparsified" Gaussian RBF weights provide a robust and useful
trade-off between statistical and computational performance. As such, they are the
default weighting scheme used for both CARP
and CBASS.
When interactively visualizing clustering results, we generally want our clusters to be joined together smoothly. To achieve this goal, we select the weight parameter $\phi$ to maximize the variance of the clustering weights. While this does not guarantee the smoothest possible clustering, we have found it to be a useful heuristic. That is, if not specified by the user, we select:
[\hat{\phi} = \underset{\phi}{\textrm{arg max}}\;\; \text{Var}\left(\exp\left{-\phi \|\vec x_i - \vec x_j\|_2^2\right}\right)]
where the variance is taken over all pairs $i, j$.^[The above maximization is not explicitly performed, but the maximum over a relatively corse gride is used.]
This is implemented in clustRviz
via the dense_rbf_kernel_weights
function.
Note that this is a function factory - it returns a function which can be called
to actually calculate the weights:
library(clustRviz) get_upper_triangle <- function(x) as.vector(x[upper.tri(x)]) weight_func <- dense_rbf_kernel_weights() weights <- weight_func(presidential_speech)$weight_mat weights[1:5, 1:5]
We see here that $\phi$ was automatically selected, yielding a variance of
var(get_upper_triangle(weights))
and a nice "spread" of weights:
hist(get_upper_triangle(weights), col = "grey80", border = "white", breaks = 30, xlab = "Dense RBF Kernel Weight Value", main = "Highly Variable RBF Weights")
If we had prior information about a good choice of the scale parameter $\phi$, we
could supply it to the initial call to dense_rbf_kernel_weights
:
weight_func_phi_1 <- dense_rbf_kernel_weights(phi = 1) weights_phi_1 <- weight_func_phi_1(presidential_speech)$weight_mat weights_phi_1[1:5, 1:5]
Not surprisingly, this gives weights with much lower variance than before.
var(get_upper_triangle(weights_phi_1))
So far, we have used (squared) Euclidean distance to define our kernel, but we can in fact use any distance function $d$ to calculate weights as: [\phi \propto \exp\left{-\phi d(\vec x_i, \vec x_j)^2\right}]
To use alternate weight functions, $d(\cdot, \cdot)$, pass the dist.method
and p
arguments to dense_rbf_kernel_weights.
For example, if we wanted to use the
$\ell_4$ metric, we could use
dense_rbf_kernel_weights(dist.method = "minkowski", p = 4)
See the dist
function in the stats
package for supported distances.
Once we have a dense set of weights, we typically wish to sparsify them to improve computational speed. There are several ways to do so, but perhaps the simplest is to take the $k$-nearest neighbors graph for some $k$. That is, we zero out most entries in the weight matrix, keeping $w_{ij}$ only if $i$ is a $k$-nearest neighbor of $j$ or vice versa.
Sparse weights can be calculated via the sparse_rbf_kernel_weights
function factory,
which works like the dense_rbf_kernel_weights
function but includes an extra
optional parameter $k$. By default $k$ is chosen as small as possible, subject
to the graph still being connected:
weight_func <- sparse_rbf_kernel_weights() weight_mat <- weight_func(presidential_speech)$weight_mat weight_details <- weight_func(presidential_speech)$type weight_mat[1:5, 1:5]
We immediately see that these weights are significantly more sparse than before. In fact, we see that for this example, approximately 88% of the weights have been zeroed out:
round(100 * mean(get_upper_triangle(weight_mat) == 0))
hist(get_upper_triangle(weight_mat), col = "grey80", border = "white", breaks = 30, xlab = "Sparse RBF Kernel Weight Value", main = "KNN Sparsified RBF Weights")
We see here that $k = 4$ was the smallest $k$ that gave a fully connected graph:
weight_details$k
If the user supplied a smaller $k$ (resulting in a non-connected graph), an error would be thrown:
sparse_rbf_kernel_weights(k = 3)(presidential_speech)
The distance metric used can be changed by passing additional arguments to the function factory as before:
sparse_rbf_kernel_weights(dist.method = "canberra")(presidential_speech)
CARP
WeightsBy default, CARP
uses Sparse RBF kernel weights, with data-driven $\phi, k$ and
the Euclidean distance. These can be changed by passing the result of sparse_rbf_kernel_weights
to CARP
. For example, if we wanted to use the Canberra distance with k = 10
neighbors,
we would call CARP
as:
CARP(presidential_speech, weights = sparse_rbf_kernel_weights(dist.method = "canberra", k = 10))
Note that the weights
function is called on the pre-processed data matrix, not
the raw data matrix.
CBASS
WeightsCBASS
requires two sets of weights, one for rows and one for columns.
The interface is the same as for CARP
, but we now can supply the row_weights
and
col_weights
arguments separately. The former will be used as the weights
argument to CARP
; the latter will be used called on the transpose of the
pre-processed data, since it is used to calculate column weights. The weights are
computed independently (with possibly different choices of $\phi$ and $k$) and
can be controlled separately: e.g.
CBASS(presidential_speech, col_weights = sparse_rbf_kernel_weights(dist.method = "canberra", k = 4), row_weights = sparse_rbf_kernel_weights(dist.method = "maximum", phi = 2))
As mentioned above, CBASS
rescales the resulting weights to ensure proper biclustering.
Even though the Sparse Gaussian RBF scheme is generally a good default choice,
users may wish to use different weighting schemes with clustRviz,
particularly
if additional information or domain knowledge are available. To this end,
both CARP
and CBASS
can use custom weighting schemes. We describe the different
interfaces to this functionality in CARP
below, and note that CBASS
has the
same behavior.
The easiest way to use custom weights with clustRviz
is to supply a weight matrix
as the weights
matrix. clustRviz
will perform some basic correctness checks, but
will otherwise use your weights "as is."
For example, if we wanted to use "chain" weights to cluster the presidents in alphabetical order, we could do so as follows:
weight_mat <- matrix(0, nrow = NROW(presidential_speech), ncol = NROW(presidential_speech)) weight_mat[cbind(seq(1, NROW(presidential_speech) - 1), seq(2, NROW(presidential_speech)))] <- 1 weight_mat[cbind(seq(2, NROW(presidential_speech)), seq(1, NROW(presidential_speech) - 1))] <- 1 image(weight_mat)
carp_fit_chain_weights <- CARP(presidential_speech, weights = weight_mat) print(carp_fit_chain_weights)
We note that the resulting clustering is essentially nonsense, but does respect our weights, for the most part:
plot(carp_fit_chain_weights)
If a weight scheme is used repeatedly, it may be useful to wrap it in a function which will then be called on the pre-processed data. For example, our "chain weight" example from above could be written as:
chain_weights <- function(X){ weight_mat <- matrix(0, nrow = NROW(X), ncol = NROW(X)) weight_mat[cbind(seq(1, NROW(X) - 1), seq(2, NROW(X)))] <- 1 weight_mat[cbind(seq(2, NROW(X)), seq(1, NROW(X) - 1))] <- 1 weight_mat } CARP(presidential_speech, weights = chain_weights)
Writing the weight scheme as a function is particularly useful for biclustering, e.g.:
cbass_chain_fit <- CBASS(presidential_speech, row_weights = chain_weights, col_weights = chain_weights)
plot(cbass_chain_fit, type="col.dendrogram")
Note that the print method for CARP
and CBASS
knows that the weights were
computed based on a user-supplied function, but cannot give more information than
that. It is possible to provide a more complex function which will lead to more
informative output, though it is beyond the scope of this vignette.
Nagorski and Allen [-@Nagorski:2018] propose the use of a spatial weighting scheme
to cluster genomic regions. It is easy to implement a version of their scheme using
clustRviz
support for custom weight schemes:
spacc_carp <- function(X, coordinates, ..., dist.cutoff = 20000, ## Distances from SpaCC_Methy in the SpaCCr package sigma = 2e-4){ ## This is not a robust / tested implementation ## ## It is provided for demonstration purposes only spacc_weights <- function(X){ distance_mat <- as.matrix(dist(coordinates)) exp(- sigma * distance_mat) * (distance_mat < dist.cutoff) } CARP(t(X), weights = spacc_weights, ...) }
Note that this implementation will not work as is on the SpaCC_Methy
example in the
SpaCCr
package because the adjacency graph implied by the SpaCC scheme on the
example is not fully connected (the probes were at well-separated genomic locations).
A more clever implementation would take advantage of this disconnectedness to separate
the problem into disjoint subproblems.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.