knitr::opts_chunk$set(echo = TRUE, fig.width=6,fig.height=5,echo=TRUE)

Modern biological experiments are increasingly producing interesting matrices.
These may represent the presence or absence of
specific gene mutations, copy number variants, microRNAs, or other
molecular or clinical phenomena. We have recently developed a tool,
*CytoGPS* [^Abrams and colleagues],
that converts conventional karyotypes from the standard text-based
notation (the International Standard for Human Cytogenetic Nomenclature;
ISCN) into a binary vector with three bits (loss, gain, or fusion) per
cytoband, which we call the "LGF model".

The `Mercator` package is intended to facilitate the
exploration of data sets. It implements metrics for continuous,
categorical, and binary data, including a subset of the 76
binary distance metrics described by [^Choi and colleagues], ensuring that
at least one representative of each of their major clusters is
included. Each resulting distance matrix can be combined with multiple
visualization techniques, providing a consistent interface to
thoroughly explore the data set.

In this vignette, we focus initially on the tools for dealing with binary matrices. We start by loading the package.

suppressMessages( suppressWarnings( library(Mercator) ) )

We proceed with a model dataset of karyotypes from patients with Chronic Myelogenous Leukemia (CML) with 400 chromosomal features recorded over 740 patients, from the public Mitelman database. Cytogenetic abnormalities recorded in Mitelman as text strings have been pre-processed with CytoGPS into binary vectors. For the sake of clarity and efficiency we have chosen a subset of patients and features for this example.

filename <- system.file("Examples/Mercator_Test_Data.csv", package="Mercator") my.data <- read.csv(filename, header=TRUE) dim(my.data)

Many of the functions of the `Mercator` package operate on a `BinaryMatrix`
S4 object, which serves as an input to subsequent functions and visualizations.

A `BinaryMatrix` object is formed from a matrix containing integer or numeric
values. Although `Mercator` was intially designed for processing and visualizing
binary data, the `BinaryMatrix` object and subsequent functions accept a variety
of integer and numeric values. In addition, all the visualization tools work with
an arbitrary distance matrix; see the vignette Mercator for Continuous Data.

Row and column names, along with optional annotations, can be assigned as for data frames.
If no row or column headings are assigned, the `BinaryMatrix` takes the row and
column names of the input matrix by default. Notice that the resulting object always
includes a "history" element that tracks how it has been processed.

my.data <- as.matrix(my.data) my.binmat <- BinaryMatrix(my.data) summary(my.binmat)

We wish to cluster the whole karyotypes of each patient to identify the
patterns of important chromosomal abnormalities that link them. To proceed,
we must transpose the `BinaryMatrix`. Transposition meaningfully transposes
the row and column headings of the `BinaryMatrix`, as well.

my.binmat <- t(my.binmat) summary(my.binmat)

The binary vectors are rarely unique.
Patients who have identical vectors of features can complicate some of the
clustering and visualization routines that we want to use (often by
introducing a division by zero). Similarly, features that have identical
occurence vectors in a patient cohort can also alter the
biological implications by automatically giving more "weight" to a
single genomic event (like a trisomy or monosomy). To deal with this
issue, the `Mercator` package includes a function to remove
duplicate or redundant binary vectors. Here we are going to remove
duplicate patients; that is, those with identical karyotypes.

my.binmat <- removeDuplicates(my.binmat) summary(my.binmat)

In the case of our data, only 136 of the 740 patient karyotypes are unique. Some of the karyotypes are "not used" in the sense that they contain none of the abnormalities selected for this limited subset.

length(my.binmat@info$notUsed) head(my.binmat@info$notUsed)

By contrast, many features are *used* but are simply redundant.

length(my.binmat@info$redundant)

`Mercator` provides easy access to functions of the `Thresher` R package,
which includes outlier detection and estimates of the number of clusters [^Wang and colleagues].
The underlying idea is that the features can be viewed as
"weight vectors" in a principal component space trying to display the samples.
The lengths of the vectors are a measure of their importance in the data set;
short vectors can (and probably should) be removed since they do not carry
much useful information. We have incorporated that feature into the
`Mercator` package.

We create a `ThreshedBinaryMatrix` to implement the algorithm.
Based on simulations described in the Thresher paper, a cutoff above approximately 0.3
can be used to select informative features. Then, we subset our `ThreshedBinaryMatrix`
to only include features above our given cutoff.

set.seed(21348) my.binmat <- threshLGF(my.binmat, cutoff=0.3) summary(my.binmat)

The red vertical line in the figure indicates the cutoff we have chosen to separate uninformative features (<0.3) from informative ones (>0.3).

Delta <- my.binmat@thresher@delta hist(Delta, breaks=20, main="", xlab="Weight", col="gray") abline(v=0.3, col='red')

The `ThreshedBinaryMatrix` object contains a `reaper` slot
that estimates the number of principal components and the number of clusters
after outliers have been removed. These values can be viewed numerically...

my.binmat@reaper@pcdim my.binmat@reaper@nGroups

... or they can be visualized with an *Auer-Gervini plot* (where we
are looking for a "long step") ...

plot(my.binmat@reaper@ag, ylim=c(0, 30)) abline(h=my.binmat@reaper@pcdim, col="forestgreen", lwd=2) abline(h=7, col="orange", lwd=2)

pts <- screeplot(my.binmat@reaper, xlim=c(0,30)) abline(v=pts[my.binmat@reaper@pcdim], col="forestgreen", lwd=2) abline(v=pts[7], col="orange", lwd=2)

There is no method that can definitely identify the number of clusters for a "perfect" solution, leaving part of the decision to the analyst's judgment. Auer-Gervini and Scree plots provide informative visualizations to facilitate this interference. The default value provided by the Auer-Gervini analysis (N=2; green) is somewhat conservative. The value provided by the broken-stick model (N=7; orange) overlaid on the scree plot is more aggressive, but is not too unreasonable based on the Auer-Gervini plot. While one could make the argument that the correct number of groups is at least seven, we will begin by using the number of groups recommended by Thresher.

kk <- 5

The `Mercator` package allows visualization of data with methods that include both
standard techniques (hierarchical clustering) and large-scale visualizations
(multidimensional scaling (MDS),
t-distributed Stochastic Neighbor Embedding (t-SNE), and iGraph.)

The `Mercator` package implements or provides access to 10 distance metrics:
Jaccard, Sokal-Michener, Hamming, Russell-Rao, Pearson, Goodman-Kruskal, Manhattan,
Canberra, Binary, and Euclidean. Although some of these metrics
can be used for continuous or categorical data, all are appropriate for
binary matrices. `Mercator` allows the user to compare different metrics
and select one that is useful for a given dataset.

Here, we will use the Jaccard distance because of its ease of
interpretability, common usage, and its appropriatness of application
to asymmetric binary data, such as the binary vector output of CytoGPS
in this dataset.
The `Mercator` constructor can be called with any initial visualization,
and visualizations can be added in an arbitrary order.

jacc.Vis <- Mercator(my.binmat, "jaccard", "hclust", K=kk)

We can represent all the distances between features within the dissimilarity matrix we have calculated on our data as a histogram, as a visual representation of relatedness.

hist(jacc.Vis, xlab="Jaccard Distance", main="Histogram of Distances")

Mercator allows us to implement common, standard visualizations such as hierarchical clustering...

plot(jacc.Vis, view = "hclust")

In this dendrogram, there appear to be at least three different "blue"
branches, suggesting that there are more than the conservative estimate
of five clusters obtained from `Thresher`. We will take a look at
the data using another visualization technique before making a final
decision.

`Mercator` can use t-distributed Stochastic Neighbor Embedding
(t-SNE) plots for visualizing large-scale, high-dimensional data in
2-dimensional space.

jacc.Vis <- addVisualization(jacc.Vis, "tsne", perplexity=25) plot(jacc.Vis, view = "tsne", main="t-SNE; Jaccard Distance")

Optional t-SNE parameters, such as perplexity, can be used to fine-tune
the plot as the visualization is created. Using `addVisualization`
to create a new, tuned plot of an existing type overwrites the
existing plot of that type.

temp.Vis <- addVisualization(jacc.Vis, "tsne", perplexity = 10) plot(temp.Vis, view = "tsne", main="t-SNE; Jaccard Distance; perplexity=10")

It does not appear that a smaller perplexity is useful for this data set.

`Mercator` allows visualization of multi-dimensional scaling (MDS) plots, as well.

jacc.Vis <- addVisualization(jacc.Vis, "mds") plot(jacc.Vis, view = "mds", main="MDS; Jaccard Distance")

At this point, we still don't know if we should increase the number of clusters in the data set form 5 to 7. Fortunately, we have access to the "silhouette width", which measures how well each element in the data set appears to be clustered.

```
barplot(jacc.Vis)
```

Ths plot of the silhouette widths confirms our suspicion from the hierarchical clustering dendrogram that the "blue" group is not as coherent/consistent as the other grousp. So, we are going to try reclustering to label more clusters.

jacc.Vis6 <- recluster(jacc.Vis, K = 6) barplot(jacc.Vis6) jacc.Vis7 <- recluster(jacc.Vis, K = 7) barplot(jacc.Vis7)

There is some improvement when K=6, but a clearly worse result when K = 7. For the rest of our analysis, we will switch to ooking at 6 groups.

kk <- 6 jacc.Vis <- jacc.Vis6 rm(jacc.Vis6, jacc.Vis7)

`Mercator` can be used to visualize complex networks using
iGraph. To improve clarity of the visualization and computational
time, we have implement the `downsample` function to reduce the
number of data points to be linked and visualized.
The idea goes back to Peng Qiu's implementation of the
SPADE clustering algorithm for mass cytometry data. The main point is to *under* sample
the densest regions of the data space to make it more likely that rarer clusters
will still be adequately sampled. (While not required with the current data set,
this idea can be quite useful with data sets containing tens of thousands of objects.)

**Note:** The `Mercator` class includes a "subset" operator that tries to
preserve earlier visualizations. This operator is fast for MDS or t-SNE models, but
is very slow for large hierarchical clustering. (It uses the implementation in the
`dendextend` package, which works by removing a single leaf at a time from
the tree.) In the next code chunk, we `downsample` to create a meaningful subset,
and then compute a new dendrogram for each subset.

X <- jacc.Vis X@view[["hclust"]] <- NULL # remove this view N <- as.matrix(X@distance) set.seed(87530) P <- downsample(40, N, 0.1) # create a downsampled subset J <- X[P] names(J@view) # need to compute a new dendrogram J <- addVisualization(J, "hclust", perplexity=5) names(J@view) plot(J, view = "tsne", main="Down-sampled t-SNE Plot")

Because our data set is small enough, we are going to create our graphical views from the oriinal object rather than the downsampled one. We can look at the resulting graph using three different "layouts".

jacc.Vis <- addVisualization(jacc.Vis, "graph", Q =0.5) plot(jacc.Vis, view = "graph", layout = "tsne", main="T-SNE Layout") plot(jacc.Vis, view = "graph", layout = "mds", main="MDS Layout") plot(jacc.Vis, view = "graph", layout = "nicely", main="Laid Out 'Nicely'", xlim=c(-1,1))

We can use the *getClusters* function to characterize each cluster and
use these for further manipulation.

We can easily determine cluster size...

my.clust <- getClusters(jacc.Vis) tab <- table(my.clust) tab

... or the patients that comprise each cluster.

C <- my.binmat@columnInfo Cl4 <- C[my.clust == 4 ,] Cl4

Finally, we are going to look at what happens if we use a different distance metric.

set.seed(8642) sokal.Vis <- Mercator(my.binmat, "sokal", "tsne", K=kk, peplexity = 10) table(getClusters(sokal.Vis), getClusters(jacc.Vis)) plot(sokal.Vis, view = "tsne", main="t-SNE; Sokal-Michener Distance; perplexity=10")

The two largest groups get assigned the same colors (by chance) in the Jaccard and the
Sokal-Michener clusterings. However, it is not at all clear how to align the smaller
groups. For that purpose we can use the `remapColors` function.

SV <- remapColors(jacc.Vis, sokal.Vis) table(getClusters(SV), getClusters(jacc.Vis)) plot(SV, view = "tsne", main="t-SNE; Sokal-Michener Distance; perplexity=10")

Now colors have been matched (as well as possible) between the two sets of visualizations.

Each `Mercator`

object stores its own palette internally, but you can change
the palette using the `slot`

funciton.

slot(jacc.Vis, "palette") <- c("red", "orange", "green", "blue", "cyan", "magenta", "purple", "black") plot(jacc.Vis, view = "tsne")

If the number of colors in the palette is smaller than the number of clusters, they will be recycled, and other plotting symbols will be introduced.

slot(jacc.Vis, "palette") <- c("red", "green", "blue", "purple") plot(jacc.Vis, view = "tsne")

[^Abrams and colleagues]: Abrams ZB, Zhang L, Abruzzo LV, Heerema NA, Li S, Dillon T, Rodriguez R, Coombes KR, Payne PRO. CytoGPS: A Web-Enabled Karyotype Analysis Tool for Cytogenetics. Bioinformatics. 2019 Jul 2. doi: 10.1093/bioinformatics/btz520. (Epub ahead of print)

[^Choi and colleagues]: Choi SS, Cha SH, Tappert CC. A Survey of Binary Similarity and Distance Measures. Systemics, Cybernetics, and Informatics 2010;8(1):43-48.

[^Wang and colleagues]: Wang M, Abrams ZB, Kornblau SM, Coombes KR. Thresher: determining the number of clusters while removing outliers. BMC Bioinformatics 2018;19(1):9.

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.