knitr::opts_chunk$set(echo = TRUE, fig.width=6, fig.height=6, echo=TRUE)

The `Mercator` package is intended to facilitate the exploratory analysis
of data sets. It consists of two main parts, one devoted to tools
for binary matrices, and the other focused on visualization. These visualization
tools can be used with binary, continuous, categorical, or mixed data, since they
only depend on a distance matrix. Each distance matrix can be visualized
with multiple techniques, providing a consistent interface to
thoroughly explore the data set. In thus vignette, we illustrate the visualization
of a continuous data set.

First we load the package.

suppressMessages( suppressWarnings( library(Mercator) ) )

Now we load a "fake" set of synthertic continuous data that comes with the Mercator package. We will use this data set to illustrate the visualization methods.

set.seed(36766) data(fakedata) ls() dim(fakedata) dim(fakeclin)

The `Mercator` Package currently supports visualization of data with methods
that include standard techniques (hierarchical clustering) and large-scale visualizations
(multidimensional scaling (MDS),T-distributed Stochastic Neighbor Embedding (t-SNE),
and iGraph.)
In order to create a `Mercator` object, we must provide

- a distance matrix,
- a character string describing the metric that was used,
- a character string defining an initial visualization method,
- and a desired number of clusters.

We are going to start with hierarchical clustering, with an arbitrarily assigned number of 4 groups.

mercury <- Mercator(dist(t(fakedata)), "euclid", "hclust", 4) summary(mercury)

Here is a "view" of the dendrogram produced by hierarchical clustering. Note that
`view` is an argument to the plot function for `Mercator` objects.
If omitted, the first view in the list is used.

plot(mercury, view = "hclust")

The dendrogram suggests that there might actually be more than 4 subtypes in the data, but we're going to wait until we see some other views of the data before doing anything about that.

`Mercator` can use t-distributed Stochastic Neighbor Embedding
(t-SNE) plots for visualizing large-scale, high-dimensional data in a
2-dimensional space.

mercury <- addVisualization(mercury, "tsne") plot(mercury, view = "tsne", main="t-SNE; Euclidean Distance")

The t-SNE plot also suggests more than four subtypes; perhaps as many as seven or eight.

Optional t-SNE parameters, such as perplexity, can be used to fine-tune
the plot when the visualization is created. Using `addVisualization`
to create a new, tuned plot of an existing type overwrites the
existing plot of that type.

mercury <- addVisualization(mercury, "tsne", perplexity = 15) plot(mercury, view = "tsne", main="t-SNE; Euclidean Distance; perplexity = 15")

`Mercator` allows visualization of multi-dimensional scaling (MDS) plots, as well.

mercury <- addVisualization(mercury, "mds") plot(mercury, view = "mds", main="MDS; Euclidean Distance")

Interestingly, the MDS plot (which is equivalent to principal components analysis, PCA, when used with Euclidean distances) doesn't provide clear evidence of more than three or four subtypes. That's not surprising, since groups separated in high dimensions can easily be flattened by linear projections.

`Mercator` can visualize complex networks using iGraph. IN the next chunk of
code, we add an iGraph visualization. We then look at the resulting graph, using three
different "layouts". The `Q` parameter is a cutoff (qutoff?) on the distance used
to include edges; if omitted, it defaults to the 10th percentile. We arrived at the value
`Q=24` shown here by trial-and-error, though one could plot a histogram of the
distances (via `hist(mercury@distance)`) to make a more informed choice.

set.seed(73633) mercury <- addVisualization(mercury, "graph", Q = 24) plot(mercury, view = "graph", layout = "tsne", main="T-SNE Layout") plot(mercury, view = "graph", layout = "mds", main = "MDS Layout") plot(mercury, view = "graph", layout = "nicely", main = "'Nicely' Layout")

The last layout, in this case, is possibly not so nice.

We can use the `getClusters` function to determine the cluster assignments and
use these for further manipulation. For example, we can easily determine cluster size.

my.clust <- getClusters(mercury) table(my.clust)

We might also compare the cluster labels to the "true" subtypes in our "fake" data set.

table(my.clust, fakeclin$Type)

The `barplot` method produces a version of the "silhouette width" plot
from Kaufman and Rouseeuw (and borrowed from the `cluster` package).

```
barplot(mercury)
```

For each observation in the data set, the silhouette width is a measure of how much we believe that it is placed in the correct cluster. Here we see that about 10% to 20% of the observations in each cluster may be incorrectly classified, since their silhouette widths are negative.

We can "recluster" by specifying a different number of clusters.

mercury <- recluster(mercury, K = 8) plot(mercury, view = "tsne")

The silhouette-width barplot changes with the number of clusters. In this case, it suggests that eight clusters may not describe the data as well as four. However, the previous t-SNE plot also shows that the algorithmically derived cluster labels don't seem to match the visible clusters very well.

```
barplot(mercury)
```

The clustering algorithm used within `Mercator` is partitioning around
medoids (PAM). You can run any clustering algorithm of your choice and assign the
resulting cluster labels to the `Mercator` object. As part of our
visualizations, we have laready pefomred hierarchcai clustering. So, we can assign
cluster labels by cutting the branches of the dendrogram. We can use the
`cutree` function after extracting the dendrogram from the view. (Note
that we use the `remapColors` function here to try to keep the same color
assignments for the PAM-defined clusters and the hierarchical clusters.)

hclass <- cutree(mercury@view[["hclust"]], k = 8) neptune <- setClusters(mercury, hclass) neptune <- remapColors(mercury, neptune)

plot(neptune, view = "tsne")

The assignments by hierarchical clustering appear to more consistent thant eh PAM clusters with the t-SNE plot, though one suspect that the assignemnts among the pink, red, and orchid groups may be difficult. The silhouette width barplot (below) confirms that hierarchical clustering works better than PAM on this data set. Only the "red" group #4 contains a large number of apparently misclassified samples.

```
barplot(neptune)
```

For our fake data set, since we simulated it, we know the "true" labels. So, we can "recluster" using the true assignments.

venus <- setClusters(neptune, fakeclin$Type) venus <- remapColors(neptune, venus) plot(venus, view = "tsne")

```
barplot(venus)
```

We can also see how the hierarchical clustering compare to the true cluster assignments.

table(getClusters(neptune), getClusters(venus))

This analaysis was performed in the following environment:

```
sessionInfo()
```

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.