The bigMap R-package: quick reference
In bigMap: Big Data Mapping

options(width=80)
knitr::opts_chunk$set(fig.align = 'center', out.width = '70%')

\bigskip \begin{abstract} The bigMap R-package integrates a \textit{mapping method} (\textit{i.e.} a multi-step unsupervised clustering protocol) for large-scale structured data. The protocol follows three steps: a dimensionality reduction of the data, a density estimation over the low dimensional representation of the data, and a final segmentation of the density landscape. For the dimensionality reduction step we use a parallelized implementation of the well-known \textit{t-Stochastic Neighbouring Embedding} (t-SNE) algorithm that significantly alleviates some inherent limitations, while improving its suitability for large datasets. For the clustering we use the \textit{perplexity-adaptive Kernel Density Estimation} (paKDE) algorithm, an adaptive KDE particularly coupled with the t-SNE framework in order to get accurate density estimates out of the embedded data, and the \textit{watertrack transform} (WTT) a variant of the \textit{rainfalling watershed} algorithm to identify clusters within the density landscape. \end{abstract}

\newpage \tableofcontents \newpage

bigMap: Big data mapping with parallelized t-SNE

The bigMap R-package integrates an improved mapping method (MM [@Todd:2016, @Garriga:2018]) within R's [@R] environment. The bigMap protocol is based on the following three steps:

dimensionality reduction down to 2D, based on a parallelized implementation of the t-Stochastic Neighbouring Embedding algorithm (t-SNE [@Maaten:2008], ptSNE [@Garriga:2018]);
kernel density estimation (KDE [@Terrell:1992]) over the embedding space, by means of the perplexity-adaptive KDE (paKDE [@Garriga:2018]) with adaptive bandwidth;
cluster identification by means of the water-track transform (WTT [@Garriga:2018]), a variant of the watershed transform family of algorithms [@Meyer:1994, @Stoev:2000, @DeBock:2005] that figures out the water tracks (i.e the riverbeds) along the valleys of the density landscape.

Package dependencies

The bigMap R-package is intended to be used in multi-core platforms. The most computationally expensive parts of the algorithm are implemented using shared memory and C++ implementations. Thus, important dependences of the bigMap package are:

Rcpp, Seamless R and C++ Integration [@Rcpp];
RcppArmadillo, Accelerating R with high-performance C++ linear algebra [@RcppArmadillo];
bigmemory, Scalable Strategies for Computing with Massive Data [@bigmemory];
parallel, Support for Parallel computation in R [@R];
snow, Simple Network of Workstations [@snow].

Further dependencies of the package are:

colorspace, [@colorspace];
RColorBrewer, [@RColorBrewer];

Using the package

library(bigMap)

bdm.local('cluster-ceab.ceab.csic.es')
bdm.mybdm('~/myPath/')

\vskip 0.1in

The workflow of the package is structured around a central object (a common R's list()) that represents an instance of a mapping of a dataset. We name this object bdm from big data mapping and so we will do throughout this document. The names of the commands in the package also stem from this acronym. The mapping protocol is build up of an ordered sequence of steps that must be preserved: each step attaches the output as a new element in the list, and the new list constitutes the input for the next step. Let's load the example included in the package,

\vskip 0.1in

bdm.example()

\vskip 0.1in

This example illustrates the structure of a bdm object after going through all the protocol,

\vskip 0.1in

dMap.copy <- exMap$dMap
exMap$dMap <- NULL

str(exMap)

Mapping data

A mapping instance has several blocks that result from the four main commands of the package. Let's see how we build it.

Initialization: bdm.init()

\label{sec:bdm_init}

Let's consider a matrix of raw data (here we take it from the exMap object itself). This is a small synthetic dataset with $n=1000$ observations drawn from a 4-variate Gaussian mixture model with 16 Gaussian components,

\vskip 0.1in

X <- exMap$data     # let's assume this is our raw data matrix
dim(X)

\vskip 0.2in

A bdm is initialized through the bdm.init() command with, at least, an input dataset (either a matrix or a data.frame) and a name to identify the dataset (the utility of the dataset name is explained later in Section \ref{sec:bdm.fName}). So, in our case,

\vskip 0.1in

myMap <- bdm.init('myDataset', X)
str(myMap)

\vskip 0.1in

myMap is now a bdm instance with three basic elements: myMap\$dSet a name identifying the dataset, a matrix myMap\$data with the raw data, and a logical value myMap\$is.distance (FALSE by default) indicating that the input matrix does not contain distances but raw observations. As the ptSNE algorithm works with pair-wise distances, we can forward a matrix of precomputed distances including the command's argument is.distance = TRUE,

\vskip 0.1in

myDistMap <- bdm.init('myDataset', X, is.distance = TRUE)     # don't run

\vskip 0.1in

In the case of a supervised dataset, we can supply a vector of labels to be mapped. This is the case of our synthetic dataset (again we will take the labels from the exMap instance),

\vskip 0.1in

L <- exMap$lbls     # assume this is our vector of labels

\vskip 0.1in

Then, we would use the argument labels = L in the initialization function,

\vskip 0.1in

myMap <- bdm.init('myDataset', X, labels = L)
str(myMap)

\vskip 0.1in

Labels can be of whatever type that can be factorized. Labels are transformed by means of as.numeric(as.factor()) into a vector of integer labels and attached afterwards as a new element myMap\$lbls.

First step: bdm.ptsne()

\label{sec:bdm.ptsne}

bdm.ptsne() performs the dimensionality reduction by means of the ptSNE algorithm. The ptSNE runs several instances of the t-SNE on different chunks of the data (partial t-SNEs) using an alternating scheme of short runs and mixing of the partial solutions. The performance of the ptSNE algorithm is governed by the parameters described below. Further details can be found in [@Garriga:2018].

\vskip 0.2in \begin{itemize} \item Run\&Mixing scheme \bigskip

\textbf{Threads} The number of partial t-SNEs to be run. The ptSNE splits the dataset into this number of elementary chunks. The larger the number of threads, the faster the computation of the final solution but the slower their convergence to a global solution. Note that the number of threads must not necessarily match the number of physical cores available. Indeed, it can be higher (what is known as multi-threading). Using multi-threading to run ptSNE on multiprocessor systems can yield significant reductions in the computational times.

\textbf{Layers} Each one of the threads above does not run a single chunk of data. Instead, the threads are chained cyclically such that each thread includes the number of chunks specified by \textit{layers} sharing with another thread as many chunks of data as specified by \textit{layers-1}. This chained structure with overlapping chunks of data creates a link between successive threads which facilitates convergence towards a global solution. As a consequence of this overlapping each data-point is taking part in several partial t-SNEs and thus, after pooling all partial solutions we have as many global solutions as \textit{layers}. The relation $layers/threads$ determines the thread-size $z=n\;layers/threads$ (where $n$ is the dataset size). The t-SNE algorithm is of order quadratic with respect to the size of the dataset. By making $z \ll n$ we overcome the unsuitability of the t-SNE algorithm for large-scale datasets.

\textbf{Epochs} The run and mixing scheme is cyclical and each cycle is structured in three phases involving a master process and several worker processes (the threads): (i) the master process mixes the data and defines the data chunks; (ii) each worker runs a partial t-SNE with the chunk of data that it has been assigned; (ii) the master pools the partial solutions from the workers into a global solution. This sequence is called an \textit{epoch}. The global solution of one \textit{epoch} is used as the starting condition for the new \textit{epoch}.

\textbf{Rounds} The number of epochs is set to $\sqrt{n}$ (where $n$ is the dataset size), and the number of iterations per epoch (epoch length) is set to $\sqrt{z}$ (where $z$ is the thread-size). Scaling the epoch length to the thread-size avoids getting too divergent solutions from each partial t-SNE. This predetermined setup, with $\sqrt{n}$ epochs and $\sqrt{z}$ iterations per epoch, is a \textit{round}. In general, the algorithm reaches a stable solution in one single round. If not, the ptSNE can run some extra rounds to refine the mapping, although the improvement achieved is usually low with respect to the computational time required.

\end{itemize}

\vskip 0.2in \begin{itemize} \item Similarities \bigskip

\textbf{Perplexity} This parameter sets a balance between finding global and local structure in the data (corresponding respectively to high and low values of perplexity). It does not exist any rule-of-thumb to set this value. In general, we will have to perform several runs testing different values.

\end{itemize}

myMap$Xdata <- exMap$Xdata
myMap$Xbeta <- exMap$Xbeta
myMap$ptsne  <- exMap$ptsne

\vskip 0.2in

myMap <- bdm.ptsne(myMap, threads = 4, layers = 2, rounds = 2, ppx = 25)

+++ running 4 threads
+++ processing data
+++ Computing Betas, perplexity 200
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 0.1631  0.2389  0.2832  0.2939  0.3375  0.7031
+++ computing ptSNE
... threads 4, size 500, epochs 62, iters 23
+++ epoch 0000/0062  9.0644e-01  2.8029e+00
--- round 01/02        <Cost>      <Size>      epSecs  time2End  eTime
+++ epoch 0001/0062  9.2820e-01  2.8029e+00    0.1136  00:00:07  09:48
+++ epoch 0002/0062  9.3100e-01  3.0781e+00    0.0705  00:00:06  09:48
... (output truncated)
+++ epoch 0031/0062  6.2072e-01  5.1665e+01    0.1072  00:00:03  09:48
--- round 02/02        <Cost>      <Size>      epSecs  time2End  eTime
+++ epoch 0032/0062  6.1840e-01  5.2777e+01    0.1042  00:00:03  09:48
... (output truncated)
+++ epoch 0062/0062  5.6600e-01  7.9076e+01    0.1131  00:00:00  09:48
... threads 4, size 500, epochs 62, iters 23
+++ epoch 0062/0062  5.6600e-01  7.9076e+01    0.1113  00:00:07

Note that the first parameter is myMap itself and the output value is a new bdm instance. This is the general rule for the package: *the command's first parameter is a bdm (a list) and the value returned is a copy of the input bdm* but including the new output (a copy of the list with a new element attached where the output of the command is included). As we did in the example above, we can overwrite the input object with the output object, so that the resulting effect is a single list that keeps growing as we go ahead with the protocol.

\vskip 0.1in

str(myMap$ptsne)

\vskip 0.1in

The new element myMap\$ptsne includes a matrix myMap\$ptsne$Y(r dim(myMap$ptsne$Y)) with the mapping positions $\left(y_i^1, y_i^2\right)$. The positions of the mapped data-points are given by layers as succesive pairs of columns (i.e. columns 1 and 2 in myMap\$ptsne$Y are the mapped positions for layer 1). We can visualize the embedding with bdm.ptsne.plot(),

\vskip 0.2in

bdm.ptsne.plot(myMap)

\vskip 0.2in

The elements myMap\$ptsne\$cost and myMap\$ptsne\$size include the values of the embedding cost and the embedding size functions [@Garriga:2018]. These elements are matrices with one column per epoch. The rows in myMap\$ptsne\$cost are the embedding cost values per thread (each partial t-SNE). The rows in myMap\$ptsne\$size are the embedding size per layer (each global t-SNE).

Note that bdm.ptsne() performs a complex sequence of operations [@Garriga:2018] and therefore it has more parameters than those shown in the example (though most of them work well with default values):

Setting up the running (parallel) environment (threads = 4, type = 'SOCK', layers = 2, boost = 2, rounds = 2):
- threads sets the number of partial t-SNEs that will be simultaneously running, each one with a subset of data of size z (the thread size) that results from the combination of threads and layers, given by $n\,layers/threads$.
- type sets the type of parallelization, intra-node parallelization (type = 'SOCK'), or inter-node parallelization (type = 'MPI'). This is related with the makeCluster() command of the snow R-package [@snow]. To use inter-node parallelization a version of the message passing interface (MPI) must be up and running in the system (see section \ref{sec:MPI}).
- rounds sets the number of rounds.
- boost changes the default run-and-mixing scheme. The number of epochs is set to $1/boost\,\sqrt{n}$ and the number of iterations per epoch to $boost\,\sqrt{z}$. The total number of iterations remains as $\sqrt{n\,z}$ but the ptSNE running time is reduced in as much as the number of inter-epoch phases is reduced (specially when using MPI parallelization). The counterpart is that convergence might be compromised if $boost \gg 1$.
Pre-processing of the raw data (whiten = 4, input.dim = 30):
- By default the command performs a whitening of the raw data (i.e. computing the principal components and dividing them by the eigen values to equalize the variances).If whiten = 3, only PCA is performed (with no whitening). In both cases, at most, a subset of the first input.dim components is used. If whiten = 2 raw data is only centered and scaled. If whiten = 1 raw data is only centered. If whiten = 0 no preprocessing of raw data is performed at all. With whiten < 3, all dimensions are considered (input.dim has no effect).
Computing the variances of the input data kernels (ppx = 100, itr = 100, tol = 1e-05):
- ppx sets the value of perplexity used to compute the input similarities (as explained above)
- itr and tol refer to the precision in the computation of the variances and in general work well with default values.
Running the partial t-SNEs (alpha = 0.5, Y.init = NULL):
- alpha sets the momentum in the gradient function. The momentum is a fraction alpha of the position gradient computed in the last iteration that is added to the position gradient computed in the current iteration. In general, the momentum works well with the default value 0.5.
- Y.init allows to start the embedding with a given initial mapping (by default, the starting positions of the embedding are randomly initialized). If we deem that a ptSNE run has not yet reach a stable solution, we can use the output of this run to initialize the embedding of a new run and save computational time.
Showing intermediate information (info = 1):
- info sets the level of intermediate information. At the lowest level (info = 0), the algorithm shows the values of the cost and size of the embedding along with an estimation of the remaining time at the end of each epoch. With info = 1 (default value) the algorithm saves the current state of the embedding as an .RData file at the end of each round (Section \ref{sec:bdm.fName}). Thus, in case of a long ptSNE, we can anticipate things like a wrong parametric configuration.
Close the parallel environment (no parameters).

Second step: bdm.pakde()

\label{sec:bdm.pakde}

bdm.pakde() computes a density estimation over the embedding area based on the paKDE algorithm.

myMap$pakde  <- exMap$pakde
myMap$pakde[[2]] <- NULL

\vskip 0.1in

myMap <- bdm.pakde(myMap, threads = 4, ppx = 50, g.exp = 2)

+++ running 4 threads
+++ paKDE for layer 1/2 +++
+++ Computing Betas, perplexity 50
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  1.362   1.923   2.283   2.697   3.449   5.593
 computing grid cell densities ...
+++ cdf 0.999

The output is attached to the list as a new element named myMap\$pakde. This element is itself a list of lists (with one element per layer). By default, the command computes the paKDE for layer one, which is attached to myMap\$pakde[[1]]. The argument layer = l allows computing the paKDE for any other layer, and the output, in this case, is attached to myMap\$pakde[[l]].

\vskip 0.1in

str(myMap$pakde)

\vskip 0.1in

The element myMap\$pakde[[1]] is a raster over the embedding area. The coordinates of the raster are saved in myMap\$pakde[[1]]\$x and myMap\$pakde[[1]]\$y, and the density values are stored in myMap\$pakde[[1]]\$z. We can visualize a heat-map of the density estimation with,

\vskip 0.1in

bdm.pakde.plot(myMap)

The bdm.pakde() command performs the following sequence of operations [@Garriga:2018]:

Setting up the running environment (threads = 2, type = 'SOCK'):
- threads and type specify the number of threads that are spawned and the type of parallelization used, intra-node parallelization (type = 'SOCK'), or inter-node parallelization (type = 'MPI'), (as explained for the bdm.ptsne() command). In this case, multi-threading does not increase efficiency, so we would typically specify as many threads as the number of physical cores available.
Computing the variances of the local kernels of the mapped data-points (ppx = 100, itr = 100, tol = 1e-05):
- The meaning of these parameters is exactly as explained for the bdm.ptsne() command, though in this case the perplexity refers to the computation of similarities in the low dimensional space (the mapped data-points). A good advice is to start with the same value of perplexity used in the high dimensional space (i.e. assume a similar degree of similarity in both, high and low dimensional spaces) and lower it gradually if the density function does not look reasonably tight to the shape of the mapping.
Computing the density estimation (g = 200, g.exp = 4):
- g defines the grid size (the length of the raster side in number of cells). We always consider a square grid of g times g cells.
- g.exp is a factor to expand the limits of the embedding size expressed in times $\sigma$. By default (g.exp = 4) the expansion is such to include 4 times $\sigma$ of the most extrem kernels (minimum and maximum in both directions), that is 0.9999367 of the probability mass of the kernel. In the example shown above, we set g.exp = 2 thus the cumulated density function does not add up to one (note the last line in the command's output, +++ cdf r sum(myMap$pakde[[1]]$z)). If the input dataset includes a few outliers, those data-points will likely be mapped far away from the rest. This situation will force an unnecessary expansion of the limits and will cause a distortion of the density estimation. We can overcome this issue using lower values of g.exp. With g.exp = 0 the limits of the grid will match those of the embedding.
Close the parallel environment (no parameters).

Third step: bdm.wtt()

\label{sec:bdm.wtt}

bdm.wtt() identifies the clusters by means of the WTT algorithm [@Garriga:2018]. This algorithm figures out the water tracks (i.e. river beds) in the valleys of the density landscape. Clusters are numbered following the height of the defining peaks. bdm.wtt() is a very fast command and has no parameters.

myMap$wtt  <- exMap$wtt
myMap$wtt[[2]] <- NULL

\vskip 0.1in

myMap <- bdm.wtt(myMap)

+++ WTT for layer 1/2 +++
clusters: 16

This command attaches a new element myMap\$wtt to the list. This element is itself a list of lists (with an element per layer). By default, the bdm.wtt() command computes only the WTT for the first layer which is attached to myMap\$wtt[[1]]. Using layer = l allows computing the WTT for any other layer. In this case the output is attached to myMap\$wtt[[l]].

str(myMap$wtt)

\vskip 0.1in

At this stage the lines that delimit the clusters are added to the density heat-map image. We can visualize the clustering with,

\vskip 0.2in

bdm.wtt.plot(myMap)

bdm.labels()

The bdm.wtt() identifies the clustering at grid cell levels (Section \ref{sec:bdm.wtt}). The element myMap\$wtt[[1]]\$C is a numeric vector with clustering labels at grid cell level. To get a vector of data-point clustering labels we use the bdm.labels() function,

\vskip 0.1in

myDataLabels <- bdm.labels(myMap)

Visual assessment of the output.

bdm.cost()

\label{sec:bdm.cost}

bdm.cost() depicts the embedding cost and the embedding size as a function of the epochs [@Garriga:2018]:

The cost of the embedding evaluates how well the low dimensional mapping matches the distances between data-points in the high dimensional space. This is a normalized value such that the lower it is the cost the better it is the matching.
The size of the embedding is the measure of the diagonal of the embedding area which keeps growing as the mapping improves.

\vskip 0.2in

bdm.cost(myMap)

The solid blue line depicts the average cost. The light blue lines around the solid blue line depict the embedding cost values by thread (each partial t-SNE). The solid red line depicts the average size. The light red lines around the solid red line depict the embedding size values by layers (each global t-SNE).

Both functions together assess the validity of the solution: if both achieve a stable value then the solution is stable and the mapping will hardly improve with more iterations. Nonetheless, it is possible that the size function keeps growing even though the cost function has reached a stable solution. This occurs when the relative position of the mapped data-points is fairly correct but the embedding still needs to grow to fairly represent the highest dissimilarities. In terms of clustering this is not significant unless we are interested in the quantitative semantics of the density estimation. If this is the case, we should better let the algorithm iterate until reaching a stable embedding size.

plotting functions

Please, refer to the package help to see specific arguments to minimally tweak the plots (e.g class-palette, background colour, data-point size). Here are some examples,

\vskip 0.1in

bdm.ptsne.plot(myMap, ptsne.cex = 0.6, class.pltt = rainbow(16), layer = 2)

bdm.qMap()

\label{sec:bdm.qMap}

bdm.qMap() maps numeric variables onto the embedding, depicting its value as a quantile-map. A quantile-map is similar to a heat-map. The difference is that in a quantile-map colours indicate quantiles instead of ranges. In this way we always get a clear depiction of the distribution of the variable throughout the embedding. By plotting the input variables we can detect relevant/irrelevant features in determining the clusters.

bdm.qMap() generates a multi-plot layout, with the first plot depicting the embedding (as bdm.ptsne.plot()) and the rest depicting the mapping of one variable each (with a maximum of 15 variables).

By defaut, bdm.qMap() plots \$data (the first 15 at most). In case we have more than 15 input variables, the argument data allows passing a matrix with a selection of the variables that we want to map. This matrix can include any numeric covariate of interest.

The argument subset, given as a vector of row indexes of the input data matrix, restricts the mapping of the given variables to a subset of the input data. The data-points that are not in subset are shown in grey. This is useful to highlight the distribution of the given variables in relation to some discrete covariate of interest.

\vskip 0.2in

bdm.qMap(myMap, data = myMap$data[, 1:4], subset = which(myMap$lbls %in% 1:8))

bdm.dMap()

\label{sec:bdm.dMap}

bdm.dMap() visualizes the distribution of a discrete covariate (or class variable) throughout the embedding. In particular, bdm.dMap() computes the join distribution $P\left(V=v_{i},C=c_{j}\right)$ where $V={v_{1},\dots,v_{l}}$ is the class/covariate and $C={c_{1},\dots, c_{g}}$ are the grid cells of the paKDE raster. The result is a fuzzy distribution of the class/covariate at each cell.

Computing the join distribution $P\left(V=v_{i},C=c_{j}\right)$ entails an intensive computation. Thus bdm.dMap() performs the computation and stores the result in a dedicated element named \$dMap.

Afterwards the density maps can be visualized with the bdm.dMap.plot() function. The bdm.dMap.plot() yields a multi-plot layout where the first plot shows the dominating value of the covariate (or dominating class) in each cell, and the rest of the plots show the density map of each covariate value (or class) (see details in the package help). The multi-plot layout can be limited to a subset of the values of the covariate (or subset of classes).

\vskip 0.2in

# assume these are our class covariates
myclasses <- c('A', 'B', 'C', 'D', 'E')[(myMap$lbls %/% 4) +1]

\vskip 0.1in

myMap$dMap <- dMap.copy

myMap <- bdm.dMap(myMap, threads = 4, data = myclasses)

+++ running 4 threads
+++ cdf 0.9999

\vskip 0.1in

We can now plot the class density maps,

\vskip 0.2in

bdm.dMap.plot(myMap, classes = c('A', 'B', 'E'))

bdm.boxp()

\label{sec:bdm.boxp}

bdm.boxp() depicts a multi-plot layout with a box-plot with input variable statistics per cluster (for the first 25 clusters as a maximum). By default box-plots are grouped by cluster. Using byVars = TRUE box-plots are grouped by input feature.

\vskip 0.2in

bdm.boxp(myMap)

Post-processing

Given the values of perplexity for both, bdm.ptsne() and bdm.pakde(), the bdm.wtt() function returns a clustering at the finest grain, meaning that, however small is a peak in the density landscape, it defines its own cluster. Though many of these clusters might be irrelevant for our purposes, this is the base line to get to coarser classifications by hierarchically merging clusters. Nonetheless, merging of clusters can be based on different criteria (e.g. spatial, temporal) and can be implemented using different heuristics. A relevant parameter here is the number of clusters of the final clustering.

Spatial merging of clusters

\label{sec:bdm.merge}

These are functions to merge clusters based on its spatial proximity. The heuristic that drives the merging is based on minimum loss of signal-to-noise-ratio (S2NR). The S2NR is the explained/unexplained variance ratio measured in the high dimensional space based on the given low dimensional clustering.

bdm.optk.s2nr()

This function performs a recursive merging of clusters until reaching a configuration of only 2 clusters and the S2NR is measured at each step.

\vskip 0.1in

myMap <- bdm.optk.s2nr(myMap, plot.optk = F, ret.optk = T)

+++ processing data
+++ computing clustering hierarchy ...
+++ initial S2N ratio ...    4.2579e-01   16
+++ merging   10 into    1   4.2423e-01   15
+++ merging    7 into    6   4.2917e-01   14
... (output truncated)
+++ merging    3 into    1   7.0791e-02    6
+++ blocked hierarchy level reached at 6 clusters !!!
+++ next level ...
+++ merging    8 into    6   7.1264e-02    5
+++ merging    2 into    1   8.3086e-02    4
+++ merging    4 into    1   2.3446e-02    3
+++ blocked hierarchy level reached at 3 clusters !!!
+++ next level ...
+++ merging    6 into    5   2.0888e-02    2
+++ hierarchy levels .... :     6    3
+++ S2NR significant loss :    14    4

\vskip 0.1in For large datasets this computation can take a while, so we can save this result by setting ret.optk = TRUE. Afterwards, we can plot it to see the optimal number of clusters in terms of S2NR,

\vskip 0.2in

bdm.optk.plot(myMap)

bdm.merge.s2nr()

Once we know the number of clusters we run this function to get the desired merged configuration.

\vskip 0.2in

myMap <- bdm.merge.s2nr(myMap, k = 6, plot.merge = F, ret.merge = T)

\vskip 0.2in By default, (i.e. plot.merge = T), the merged configuration is shown on top of the original clustering with thicker lines. The result can be saved by setting ret.merge = T and shown afterwards by means of the bdm.wtt.plot() function.

\vskip 0.2in

bdm.wtt.plot(myMap)

Working with remote multiprocessor systems

Easy managing of ptSNE runs.

The bigMap mapping protocol entails a computationally intensive part (i.e. ptSNE, pKDE and WTT) and a posterior visualization to analyze the results. The former imply strong hardware requirements and is likely to be run in a mutiprocessor-system (e.g. a cluster of multiple inter-connected nodes with several physical cores each). The latter requires easy graphical interactivity, which is not typically available when working with remote multi-processor systems. Therefore, the bigMap package usually involves a cyclic process of submitting a job to the remote system, pooling the result to the local machine, draw some conclusion and start again with different parameters, until getting to a satisfactory result. In order to ease this task, the bigMap package includes some simple functions to systematically save the results (Functions \ref{sec:bdm.mybdm}, \ref{sec:bdm.fName}, \ref{sec:bdm.save}). Also, users working in a local machine with a secure file transfer client (scp) can set up the bigMap environment so that the results are automatically transferred to the user's local machine (Functions \ref{sec:bdm.local}, \ref{sec:bdm.scp}).

bdm.mybdm()

\label{sec:bdm.mybdm}

This function sets/gets a default path to save results,

bdm.mybdm('~/myPath/')

bdm.fName()

\label{sec:bdm.fName}

This function returns a default file name to save a bdm object (an *.RData file). The file name stems from the dataset name stored in \$dSet (Section \ref{sec:bdm_init}) and keeps track of the main parameters used for the mapping,

bdm.fName(myMap)

The name includes the number of threads (_zxx), the number of layers (_lxx), the boost factor (_bxx), the number of rounds (_rxx), the perplexity used for the ptSNE (_pxxx) and the number of clusters found (_kxxx).

bdm.save()

\label{sec:bdm.save}

This function saves a bdm object as an *.RData file with default file name (as given by bdm.fName()) and default path (as given by bdm.mybdm()),

bdm.save(myMap)

## +++ saved to ~/myPath/myDataset_z4_l2_b1_r2_p25_m15.RData

bdm.local()

\label{sec:bdm.local}

This function sets/gets the IP address of the local machine to which we would like the remote system to send the results,

bdm.local('xxx.xxx.xxx.xxx')

bdm.scp()

\label{sec:bdm.scp}

This function saves and transfers a bdm object. The object is saved in the remote system with default path and default file name, and is transfered to the local machine and saved again with default path and default file name. The necessary condition is to keep the same path structure in both machines. A further condition is to have ssh private key authentication access.

bdm.scp(myMap)

## +++ saved to ~/myPath/myDataset_z4_l2_b1_r2_p25_m15.RData

Summary script

The bigMap mapping process can be summarized in a template of a simple R script (bdmScript.R),

library(bigMap)
# set bigMap environment
bdm.local('xxx.xxx.xxx.xxx')
bdm.mybdm('~/myPath/')
# load dataset
X <- read.csv('~/myPath/myDataSet.csv')
# mapping protocol
myMap <- bdm.init('myDataSet', X)
myMap <- bdm.ptsne(myMap, threads = 64, type = 'SOCK', layers = 2, boost = 2,
    rounds = 4, whiten = TRUE, input.dim = 10, ppx = 300, info = 1)
myMap <- bdm.pakde(myMap, threads = 16, type = 'SOCK', ppx = 300)
myMap <- bdm.wtt(myMap)
# save and transfer to local machine
bdm.scp(myMap)

Let's assume we have a dataset in file ~/myPath/myDataSet.csv with n = 64000 rows (and whatever number m of dimensions). This bdmScript.R will run 64 processes (partial t-SNEs) with a thread-size of $z = 64000/642 = 2000$. The raw data X will be preprocessed performing a PCA and a whitening and only the first ten components (assuming m > 10) will be used as input data. The ptSNE algorithm will perform 4 rounds of $\sqrt{64000} = 253$ epochs per round and $\sqrt{2000} = 45$ iterations per epoch. As we set progress = 1, the program will save a bdm object with the current embedding, in folder '~/myPath/' and file name 'myDataSet_z64_l03_r01_p300.RData' after the first round, 'myDataSet_z64_l03_r02_p300.RData' after the second round, and so on. At the same time, the intermediate results are transferred to the local machine with IP address xxx.xxx.xxx.xxx. After the last round, the program will run the pKDE and the WTT. Finally, it will save a bdm object with the mapping protocol completed, with file name 'myDataSet_z64_l03_r02_p300_kss.RData' (where ss* stands for the number of clusters found). This file will also be transferred and saved to the local machine (assuming that the path '~/myPath/' exists in the local machine).

Typically, the bdmScript.R would be submitted as a batch job to a multiprocessor-system via qsub or alike (depending on the platform), specifying the type of parallelization interface to be used. Submitting jobs to a multiprocessor-system is quite platform dependent.

Intra-node (socket) parallelization

\label{sec:SOCK}

When using intra-node (or socket) parallelization all worker processes (64 threads in our example) are spawned within the same single node where the master process is submitted. With intra-node parallelization memory access is super fast but computation power is limited to the number of physical cores available in that node (usually 16, 20 or 64 at the best). For the case of our example (with 64 threads), submitting the job to a node with 16 cores overloads each core with 4 processes, i.e each core will have to run 4 threads at the same time. This is known as multi-threading and obviously results in longer running times in comparison with single-threading (1 thread per core). Nonetheless, because of the quadratic complexity of the t-SNE algorithm, multi-threading is in general a good choice, that is, given a fixed number of physical cores, the higher it is the number of threads, the lower it is the thread-size accounting for significant savings in computational time [@Garriga:2018].

In the following, we show how we would submit our example script using SOCK parallelization (this example is for illustration purposes only and may work differently in many platforms):

We (typically) must set some path environment variables to point to the programs we need to run (this is platform dependent),

~$ export PATH=/home/soft/R-3.5.1/bin:$PATH

We submit the job with,

~$ qsub -pe make 16 -l h_vmem=8G R -q --vanilla -f bdmScript.R

This command submits the job to a node with 16 cores with 8GB of memory each and calls R to run the bdmScript.R. Note that the argument -pe make 16 specifies the number of physical cores that we want available in our socket parallel environment, regardless of the number of threads that will be running (64 in our case).

Inter-node (MPI) parallelization

\label{sec:MPI}

The use of inter-node (or message passing interface, MPI) parallelization is somewhat more contrived and subject to the existence of a low-latency/high-bandwidth system network (e.g. InfiniBand) and a high performance MPI messaging module (e.g. ompi, mpich2) installed on the system. When using MPI parallelization the worker processes are spawned through all the nodes of the cluster, independently of the node where the master process resides, and sharing data between node implies message passing through the network. Thus, memory access is much slower than in the case of socket parallelization. The counterpart is that we can (potentially) make use of as many physical cores as we have in the cluster. In other words, if we need to spawn 200 threads to map a large dataset we can set up an MPI parallel environment with 200 physical cores to bind one core per thread.

Using MPI we must make some major changes into our bdmScript.R (this example is for illutration purposes only and may work differently in many platforms):

For technical reasons, an MPI cluster can be started only once per job. This means that we must split the bdmScript.R into two parts:
- Part 1 (bdmScript1.R)
``` library(bigMap)

set bigMap environment

bdm.local('xxx.xxx.xxx.xxx') bdm.mybdm('~/myPath/')

load dataset

X <- read.csv('~/myPath/myDataSet.csv')

mapping protocol (step 1: ptSNE)

myMap <- bdm.init('myDataSet', X) myMap <- bdm.ptsne(myMap, threads = 64, type = 'MPI', layers = 2, boost = 2, rounds = 4, whiten = TRUE, input.dim = 10, ppx = 300, progress = 1)

save and transfer to local machine

bdm.scp(myMap)

```
- Part 2 (bdmScript2.R)
``` library(bigMap)

set bigMap environment

bdm.local('xxx.xxx.xxx.xxx') bdm.mybdm('~/myPath/')

load previously saved result

myMap <- load('~/myPath/myDataSet.RData')

mapping protocol (steps 2 and 3: paKDE, WTT)

myMap <- bdm.pakde(myMap, threads = 16, type = 'MPI', ppx = 300) myMap <- bdm.wtt(myMap)

save and transfer to local machine

bdm.scp(myMap)

```
Note that we must change the argument type = 'SOCK' to type = 'MPI' in both commands, bdm.ptsne() and *bdm.pakde()*.
To submit the jobs we must write a job-script file. We suggest to use a job-script template, namely snow.sh, that we can use to submit any R script based on the snow R-package. This is our snow.sh job-script template (this is platform dependent),

#!/bin/bash
# execute as:
# qsub -l h_vmem=XG -pe ompi np snow path_to_myRscript/myRscript.R
# job name
#$ -N snowjob
# Use current working directory
#$ -cwd
# Set path environment variables
# MPI
export PATH=/home/soft/openmpi-2.1.5/bin:$PATH
export LD_LIBRARY_PATH=/home/soft/openmpi-2.1.5/lib:$LD_LIBRARY_PATH
# R
export PATH=/home/soft/R-3.5.1/bin:$PATH
# snow R-package
export PATH=/home/soft/R-3.5.1/lib64/R/library/snow:$PATH
# Execute
mpirun -np $NSLOTS RMPISNOW --no-save -q -f $1

We submit the first job with,

~$ qsub -pe ompi 65 snow.sh bdmScript1.R

Once the first job is finished, we submit the second job with,

~$ qsub -pe ompi 65 snow.sh bdmScript2.R

The commands in 3 and 4 create an MPI parallel environment with 65 physical cores and run the RMPISNOW script (from R-package snow) at each one. The RMPISNOW script is responsible to start an R environment at each core. One of them is configured as the master process and runs the bdmScript.R and the rest (64) are configured as worker processes to run one thread each one.

References

Any scripts or data that you put into this service are public.

bigMap documentation built on July 8, 2020, 6:41 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

bigMap Big Data Mapping

The bigMap R-package: quick reference In bigMap: Big Data Mapping

bigMap: Big data mapping with parallelized t-SNE

Package dependencies

Using the package

Mapping data

Initialization: bdm.init()

First step: bdm.ptsne()

Second step: bdm.pakde()

Third step: bdm.wtt()

bdm.labels()

Visual assessment of the output.

bdm.cost()

plotting functions

bdm.qMap()

bdm.dMap()

bdm.boxp()

Post-processing

Spatial merging of clusters

bdm.optk.s2nr()

bdm.merge.s2nr()

Working with remote multiprocessor systems

Easy managing of ptSNE runs.

bdm.mybdm()

bdm.fName()

bdm.save()

bdm.local()

bdm.scp()

Summary script

Intra-node (socket) parallelization

Inter-node (MPI) parallelization

set bigMap environment

load dataset

mapping protocol (step 1: ptSNE)

save and transfer to local machine

set bigMap environment

load previously saved result

mapping protocol (steps 2 and 3: paKDE, WTT)

save and transfer to local machine

References

Try the bigMap package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

bigMap
Big Data Mapping

The bigMap R-package: quick reference
In bigMap: Big Data Mapping