# The EMbC R-package: quick reference In EMbC: Expectation-Maximization Binary Clustering

options(width=80)


\graphicspath{{imgs/}}

\bigskip \begin{abstract} In this document we give a brief overview of the EMbC R-package with special emphasis on its use for behavioural annotation of animal's movement trajectories. For details about the EMbC algorithm please refer to (Garriga et. al 2016) and for further details about the package please refer to the package reference manual. \end{abstract} \bigskip

# The EMbC Algorithm

The Expectation-maximization binary clustering (EMbC) is a general purpose, unsupervised, multi-variate, clustering algorithm [@embc2015], driven by two main motivations: (i) it looks for a good compromise between statistical soundness and ease and generality of use - by minimizing prior assumptions and favouring the semantic interpretation of the final clustering- and, (ii) it allows taking into account the uncertainty in the data. These two features make it specially suitable for the behavioural annotation of animal's movement trajectories.

The method is a variant of the well sounded Expectation-Maximization Clustering (EMC) algorithm - i.e. under the assumption of an underlying Gaussian Mixture Model (GMM) describing the distribution of the data-set - but constrained to generate a binary partition of the input space. This is achieved by means of the delimiters, a set of parameters that discretize the input features into high and low values and define the binary regions of the input space. As a result, each final cluster includes a unique combination of either low or high values of the input variables. Splitting the input features into low and high values is what favours the semantic interpretation of the final clustering.

The initial assumptions implemented in the EMbC algorithm aim at minimizing biases and sensitivity to initial conditions: (i) each data point is assigned a uniform probability of belonging to each cluster, (ii) the prior mixture distribution is uniform (each cluster starts with the same number of data points), (iii) the starting partition, (i.e. initial delimiters position), is selected based on a global maximum variance criterion, thus conveying the minimum information possible.

The number of output clusters is $2^m$ determined by the number of input features $m$. This number is only an upper bound as some of the clusters can vanish along the likelihood optimization process. The EMbC algorithm is intended to be used with not more than 5 or 6 input features, yielding a maximum of 32 or 64 clusters. This limitation in the number of clusters is consistent with the main motivation of the algorithm of favouring the semantic interpretation of the results.

The algorithm deals very intuitively with data reliability: the larger the uncertainty associated with a data point, the smaller the leverage of that data point in the clustering.

With respect to close related methods like EMC and Hidden Markov Models (HMM), the EMbC is specially useful when: (i) we can expect bi-modality, to some extent, in the conditional distribution of the input features or, at least, we can assume that a binary partition of the input space can provide useful information, and (ii) a first order temporal dependence assumption, a necessary condition in HMM, can not be guaranteed.

# The EMbC R-package

The EMbC algorithm is of general purpose and can deal with any type of data sets or time series. However, the EMbC R-package is mainly intended for the behavioural annotation of animals' movement trajectories where an easy interpretation of the final clustering and the reliability of the data constitute two key issues, and the conditions of bi-modality and unfair temporal dependence usually hold. In particular, the temporal dependence condition is easily violated in animal's movement trajectories because of the heterogeneity in empirical time series due to large gaps, or prefixed sampling scheduling.

Input movement trajectories are given either as a data.frame or a Move object from the move R-package. The package deals also with stacks of trajectories for population level analysis. Segmentation is based on local estimates of velocity and turning angle, eventually including a solar position covariate as a daytime indicator.

The core clustering method is complemented with a set of functions to easily visualize and analyse the output:

• clustering statistics,
• clustering scatter-plot (2D and 3D),
• temporal labelling profile (ethogram),
• plotting of intermediate variables,
• confusion matrix (numerical validation with respect to an expert's labelling),
• visual validation versus external information (e.g. environmental data),
• generation of kml or web-map documents for detailed inspection of the output.

Also, some functions are provided to further refine the output, either by pre-processing (smoothing) the input data or by post-processing (smoothing, relabelling, merging) the output labelling.

The results obtained for different empirical datasets suggest that the EMbC algorithm behaves reasonably well for a wide range of tracking technologies, species, and ecological contexts (e.g. migration, foraging).

## Working Environment

The EMbC package has dependencies with the following packages:

• methods, formal methods and classes [@methods];
• sp, classes and methods for spatial data [@sp1; @sp2];
• maptools, tools for reading and handling spatial objects [@maptools];
• mnormt, the multivariate normal and t distributions [@mnormt];
• RColorBrewer, ColorBrewer palettes [@RColorBrewer];

We also suggest the package rgl, 3D visualization device system (OpenGL) [@rgl], to allow for dynamic 3D scatter-plots in multivariate analyses.

For researchers who are familiar with the MoveBank framework, we include a special link with the package move, Visualizing and analizing animal track data [@move], to allow users to make use of Move objects as input trajectories.

## Basic structure

Basically, the package consists of a hierarchy of classes:

• binClst, the main class, representing the binary clustering of a multivariate data set;
• binClstPath, a child class of the former, representing the binary clustering of a trajectory;
• binClstStck, basically a list of binClstPath objects resulting from the global clustering of a stack of trajectories.

Instances of these classes are build by means of two main constructors:

• embc(), the main core of the package, implementing the EMbC algorithm itself; this constructor takes as input a matrix of data-points and returns an object of class binClst with the multivariate binary clustering of the input data;

• stbc(), a specific constructor for the behavioural annotation of movement trajectories; the input to this constructor is a trajectory (given either as a data.frame, a Move object or a list of them) and returns an object of class binClstPath (or any of its child classes) with the bivariate (velocity/turn) clustering of the trajectory; eventually it can compute a trivariate clustering by including a parameter indicating a solar covariate (either height or azimuth) to be used as a daytime indicator.

The behaviour of the constructors can be modified by means of different parameters (e.g. maximum number of iterations, information shown at each step, pre-smoothing of the data).

The output objects have several slots containing all information related to the binary clustering (input data, intermediate computations and output data). All slots are accessible and can be used with any R function external to the package or even modified. However, we recommend not to manually change the values in the slots in order to keep the internal consistency.

library(EMbC)


# Class: binClst

This is the core class that implements the multivariate binary clustering algorithm. The input data-set is given as a matrix with data points given as rows and input features as columns. No more than 5 (6 at most) variables should be used in order to get a human interpretable set of clusters.

Let's use the object x2d included in the package. This object contains a set of data points generated from a bivariate GMM with four components (slot [email protected]), and a labelling indicating which component generated each data point (slot [email protected]);

par(mgp=c(1.5, 0.4, 0), cex.lab=0.8, cex.axis=0.8)
plot(x2d@D, col=x2d@L, xlab='X1', ylab='X2')
# [email protected] is a matrix with the input data
# [email protected] is a numeric vector with the reference labelling


## Binary clustering

To perform the clustering of a data-set we simply call the embc() constructor passing in a matrix with the input data and storing the result in a variable (e.g. mybc);

mybc <- embc(x2d@D)


At each iteration, the algorithm shows the iteration number, the current likelihood value, the number of effective clusters and the number of labels that have changed with respect to the previous iteration.

## Slots

mybc is an instance of class binClst. Any slot of a binClst object is accessible and can be used by (passed to) any R function. The most basic slots of a binClst object are:

slotNames(mybc)

• [email protected], a matrix with the input data points;
• [email protected], a matrix of the same dimension as the input data matrix, with a reliability value (ranging from 0 to 1) for each input value, (by default is a matrix of ones);
• [email protected], a matrix with the values of the delimiters for each binary region;
• [email protected], a list where each element is a named list with the Gaussian parameters of each output cluster;
• [email protected], a matrix with the likelihood weights of each data-point with respect to each output cluster;
• [email protected], a numeric vector with the output labelling of each location, (the number of the cluster with the highest likelihood weight, coded as 1:LL, 2:LH, 3:HL, and 4:HH).

## Likelihood Plot

The likelihood plot allows a visual assessment of the convergence of the algorithm;

# the lkhp() function allows an offset parameter;
lkhp(mybc)       # left panel
lkhp(mybc, 10)   # right panel


The last iterations may show some decrease in likelihood. This is due to a slight discrepancy between binary and optimal likelihood clusterings that can appear at the last steps of the algorithm, normally involving just a few data-points at the boundaries of the binary regions (note the low number of data-points changing their labels beyond iteration number 12 where the likelihood starts decreasing).

## Clustering parameters

The function stts() shows the statistics of the clustering. The columns $X_{i}.mn$ and $X_{i}.sd$ show the mean and standard deviation of the input features. The last two columns show the marginal distribution of the clustering in absolute (number of data-points) and relative (percentage) values;

stts(mybc)


The complete set of parameters of the Gaussian mixture is accessible through the slot [email protected] This slot is a list of inner named-lists (M for mean and S for the covariance matrix) for each cluster. For instance, the parameters for cluster 1 (LL) are;

mybc@P[[1]]


The delimiters are accessible through the slot [email protected] where we have the min() and max() values that delimit each binary region;

mybc@R


## Clustering scatter-plot

The function sctr() makes a scatter-plot of the data-points, showing the clusters in different colours, and depicting the binary delimiters (light grey lines) to show the binary regions;

sctr(mybc)


The NC in the legend stands for not classified points. Not classified points may appear only when performing the behavioural annotation of movement trajectories (explained later in this document) and correspond to outliers due to errors or gaps in the trajectory or, typically, to the last track of the trajectory.

## Clustering validation

In a supervised case, that is in case that an expert's labelling is available, we can use this labelling to validate the results of the clustering. The expert labelling must be numerically coded and translated to the range of the number of clusters, and must be given as a numeric vector with one numeric label for each location.

We can make a visual validation of the clustering versus the expert labelling by means of the sctr() function passing in the expert labelling vector as a second parameter;

sctr(mybc, x2d@L)
# the top plot shows the clustering result;
# the bottom plot shows the reference labelling;


We can perform a numeric assessment of the clustering in terms of a confusion matrix by means of the function cnfm();

cnfm(mybc, x2d@L)


The confusion matrix shows values of row precision and row F-measure, and values of column recall and column F-measure. The 3x2 subset of cells at the bottom-right show respectively: the overall accuracy, the average recall, the average precision, NaN, NaN and the Macro-F-measure.

# Class: binClstPath

The binClstPath is a binClst subclass intended to automatically perform the bivariate clustering of a movement trajectory, based on estimated local values of velocity and turn. It can also perform a trivariate clustering by incorporating a daytime covariate (i.e. solar height or solar azimuth).

The input data-set is a trajectory given as a data.frame with timestamps, longitudes and latitudes in columns 1:3 respectively (column headers are user free). Timestamps must be given as.POSIXct() with the specific format "%Y-%m-%d %H:%M:%S".

As an example, the package includes an object named expth. This is a synthetically generated trajectory stored as a data.frame;

head(expth)


expth is a synthetically generated trajectory with expert labelling (note the column expth\$lbl). Further columns of data can be included in the input data.frame as long as the first three columns respect the required format. Tip: by including an expert labelling with a column labelled lbl all validation functions will make use of it by default. ## Bivariate velocity-turn clustering To perform the bivariate velocity/turn clustering of this trajectory we simply call the stbc() constructor passing in the data.frame with the time/space coordinates of the trajectory and storing the output binClstPath object in a variable (e.g. mybcp); mybcp <- stbc(expth, info=-1) # info=-1 supresses any step wise output information  The output object mybcp is a binClstPath instance with the following slots; slotNames(mybcp)  As a child class, a binClstPath object inherits and extends the set of slots of the binClst class. The basic slot differences with respect to the binClst class are: • [email protected], a data.frame with three first columns named as dTm, lon, lat plus all additional columns of data included in the input data.frame; • [email protected], a numeric vector with the computed time span between locations; • [email protected], a numeric vector with the estimated distances between locations, computed as loxodromic lines; • [email protected], a numeric vector with local heading directions, given in clockwise radians from North (a value of$2\,\pi$is used to distinguish no movement from movement heading North); • [email protected], is the matrix of input data that in this case is automatically generated with the estimated local values of velocity and turn; • [email protected], is the matrix of uncertainties that is also automatically generated based on the time-spans between locations. Slots tracks, midpoints and bursted are related to the bursted visualization of the trajectory (covered later in this document) and should not be manipulated. ## Basic functionality Because of class inheritance, all functionality described for the binClst class (e.g. likelihood plot, clustering parameters, scatter-plot, validation) holds for a binClstPath instance. stts(mybcp)  sctr(mybcp)  cnfm(mybcp) # the expert labelling given in expth$lbl is used by default


Nonetheless, the binClstPath class has some particular functionalities of special interest for the case of behavioural annotation of movement trajectories. These functionalities are described in the following.

### Labeling profile

The function lblp() plots the temporal series of data and the temporal profile of the behavioural labelling;

# lims=c(a, b) limits the plot to a chunk of the trajectory
lblp(mybcp, lims=c(100, 500))


### Fast visualization of the annotated trajectory

The function view() shows the annotated trajectory and a top panel with the temporal sequence of behaviours;

# this function allows a parameter lims=c(a,b) as well
view(mybcp, lims=c(100, 500))


### Detailed inspection of the annotated trajectory

We can generate kml or html documents for a detailed inspection of the output by means of google-earth or the user's system browser. The package allows two types of visualization of the annotated trajectory: a point-wise visualization (functions pkml() or pmap()) or a burst-wise visualization (functions bkml() or bmap()) [@embc2015];

# point-wise kml doc generation;
# display=TRUE launches google-earth from within R;
pkml(bc, display=TRUE)


By default, the kml or html documents are named with a Sys.time() based name and saved in a folder embcdocs automatically generated in the user's home directory. This can be modified by means of the corresponding parameters.

The burst-wise visualization requires the computation of burst segments and midpoints. This is computed only the first time that a burst-visualization of a trajectory is requested. In case of long trajectories, this process can take some time.

### Plot intermediate variables

Intermediate data computed by the stbc() constructor and stored in the binClstPath object can be easily plotted with automatic formatting and labelling of axes;

# plotting time-spans, distances and heading directions;
# this is the default behavior when we just pass the binClstPath instance;
varp(mybcp)

# plotting input data (estimated local values of velocity and turn);
varp(mybcp@X)

# plotting certainties associated to each data-point (and input feature)
varp(mybcp@U)


Indeed, the function varp() is a wrapper for the R plot() function. The purpose of this function is simply to ease the visualization of intermediate variables by formatting and labelling the axes accordingly to each one.

## Using Move objects from the R-package move

Note: the dependency with respect to the move R-package has been dropped, and the use of the old binClstMove objects is now deprecated. Nonetheless Move objects can still be passed directly to the stbc() function.

This is intended for users having trajectories in Movebank (https://www.movebank.org/) and familiarized with the move R-package. Let's use the leroy data in the Move R-package.

library(move)
data(leroy)


leroy is a GPS trajectory of an urban Fisher (Martes pennati) with r length([email protected]) tracks, spanning from r [email protected][1] to r [email protected][length([email protected])], with a mean time interval between tracks of r round(mean(diff([email protected])),1) minutes. Move objects can be passed directly to the stbc() constructor;

# leroy is passed directly to the constructor
leroybc <- stbc(leroy, info=-1)


## Trivariate clustering: including a daytime covariate

Daytime covariates refer to the solar position. This can be given as solar height in degrees above the horizon (night/day distinction), or by solar azimuth in degrees from north (sunrise/sunset distinction).

Including daytime covariates is the natural way of incorporating time information in the clustering of an animal's movement trajectory, with the potential advantage of increasing the maximum number of output clusters to $2^3=8$, i.e. the number of movement behaviours that can potentially be distinguish.

A trivariate clustering including a daytime covariate is done by means of the parameter scv with possible values 'height' or 'azimuth';

leroybc3 <- stbc(leroy, scv='height', info=-1)


The output of the stbc() constructor is still a binClstPath (the binClstMove object of previous versions is deprecated). As we included a covariate, leroybc3 corresponds now to a trivariate binary clustering and therefore its functionality presents some particularities.

Let's see the statistics of the clustering;

stts(leroybc3)


Features are ordered as X1:daytime, X2:velocity and X3:turn. Note that highs and lows for daytime (the solar height in degrees above the horizon) do not necessarily correspond to daytime or night-time clusters (note the negative mean for X1 in HXX clusters). This is so because almost all of the activity of this animal happens during the night and it is more likely to discern different behaviours along night-time.

## Trivariate clustering scatter plot

By default, the sctr() function of a trivariate clustering depicts a double scatter-plot corresponding to low and high values of the covariate respectively. This can be changed by means of the parameter showVars=c().

sctr(leroybc3, showVars=c(1, 2, 3))
# showVars=c(1,2,3) is the default option and it is only shown for illustrative purposes
# by default the background colour is set to light-grey to enhance visibility
# the "bg"" parameter allows changing this default behaviour


If the R-package rgl is installed one can use the function sct3() to get a dynamic 3D (i.e. can be zoomed and rotated) plot, more useful for a visual understanding of the clusters.

sct3(leroybc3, showClst=c(5, 6, 7, 8))
# with showClst=c() we can restrict the plot to a particular subset of clusters


The sct3() function is defined for and inherited from the binClst class, and therefore intended for a general multivariate clustering. If the number of input features is greater than 3 and showVars=c() is not specified, the first three variables are used by default.

## Smoothing

When clustering a time series the EMbC disregards the temporal information. As a result, the output labelling may reveal small (possibly irrelevant) changes in behaviour framed in a broader temporal context (e.g. a long-term predominant behavioural mode).

The package includes two possibilities to account for the temporal information in the time series and smooth out the fine grain locality of the output labelling.

The smth() function applies a post-smoothing procedure [@embc2015] to the output labelling and returns a smoothed copy of the input instance;

# dlta is the maximum likelihood difference to accept a relabelling
# dlta=1 (accept all changes) is the default behaviour
postbc3 <- smth(leroybc3, dlta=0.9)


Alternatively, a pre-smoothing of the input data is also possible by means of the parameter smth of the stbc() constructor.

# smth sets the smoothing time window length in hours
prebc3 <- stbc(leroy, smth=1, scv='height', info=-1)


The lblp() function allows comparing two output labellings, adding a bottom line indicating the differences;

lblp(postbc3, smth(prebc3), lims=c(200, 600))
# of note:
# although performing a pre-smoothing, we can still aply a post-smoothing;
# there is no real need to instantiate the smoothed copy of prebc3;
# this is useful for saving memory in case of long trajectories;


## Relabelling

Note that by pre-smoothing the input data, cluster 5 (HLL) has been merged into cluster 6 (HLH) and we get a final clustering with only 7 different behaviours. When merging occurs, the semantics of the final labelling is somewhat misleading because the final labelling is only a result of how the algorithm evolved until reaching the merging point. In any case, the label should be read as HLX, that is, by taking into account that the last feature (in this case the turn) is meaningless given the values of the rest (i.e turn can be either H or L given H values of daytime and L values of velocity).

Using the pkml() function we can visualize which locations correspond to cluster HLH;

pkml(smth(prebc3), showClst=6, display=TRUE)


\begin{figure}\centering \includegraphics[width=12cm,height=5.5cm]{pkml0.png} \caption{Fisher (\textit{Martes pennati}) foraging trajectory. A kml point-wise view of the annotated trajectory showing only the HLH locations. The blob shows the spatial clustering of these locations most likely indicating the nest.} \label{pkml0} \end{figure}

By combining the spatially clustered distribution of locations HLX (Figure \ref{pkml0}) with the semantics of the cluster (high daytime, low velocity), we could tell that these locations are most probably indicating the nests.

Obviously, the package does not deal with labels like HLX. However, one can change labels as desired (even to manually force the merging of two clusters). In this case, we would probably feel more comfortable by relabelling the cluster HLH (cluster number 6) as HLL (cluster number 5) to suggest a more clear semantics of resting behaviour;

rlbl(prebc3, 6, 5)


Note that the function rlbl() does not return a relabelled copy of the input instance, instead it relabels the self instance. Nonetheless, the parameters of the clustering remain unchanged. The relabelling is effective only for visualization purposes and can be easily reversed by means of the parameter "reset".

## Validation versus external information

The chkp() function is similar to lblp() but plots the labelling profile versus a control variable (e.g. environmental information). The control variable must be given as a numeric vector that is depicted as coloured background bars (with specific parameters to control the colouring and legend labels);

chkp(smth(prebc3), lims=c(200, 600))
# the solar height is the control variable used by default;
# note the relabelling we did before;


# Class: binClstStck

The binClstStck is an extension (not a child class) of the binClstPath class particularly designed to work with multiple trajectories. This is intended for population level analysis from trajectories of several individuals, or period level analysis by splitting long trajectories of several years.

To illustrate this let's figure out two trajectories from our example path, simulating two different individuals;

tmp <- runif(nrow(expth))
# simulated trajectory of individual 1
expth1 <- expth[which(tmp<=0.5), ]
# simulated trajectory of individual 2
expth2 <- expth[which(tmp>=0.5), ]


To perform the clustering of a stack of trajectories we pass the individual trajectories to the stbc() constructor as a list (either of data.frame trajectories, Move objects, or a mixture of them);

# we can combine data.fame trajectories and move objects
# only for illustrative purposes !!!
mystck <- stbc(list(expth1, expth2, leroy), info=-1)


In this case, the stbc() constructor returns an instance of the binClstStck class. In general, all the functionality described for a binClst class will work for a binClstStck instance;

stts(mystck)

sctr(mystck)


The exception is the cnfm() function. This function will work only if expert's labelling is supplied for all trajectories in the stack (in our example, leroy does not have expert's labelling);

cnfm(mystck)
# this will only work when expert labelling is given for all trajectories in the stack


## binClstStck slots

It is worth noting that a binClstStck instance is not a binary clustering object itself. Instead, it is an object with two slots:

slotNames(mystck)

• slot [email protected] is a binClst object with the population level clustering, thus it has no path associated, and functions like view(), pkml() or bkml() will not work;
class(mystck@bC)

• slot [email protected] is a list of binClstPath objects, which are the results of the population level clustering upon each individual;
class(mystck@bCS)


Each element in [email protected] is a binClstPath instance corresponding to each individual path given in the input data list;

lapply(mystck@bCS, class)


It is important to keep this in mind when applying the above functions to either the population ([email protected], a binClst instance) or the individual ([email protected][[i]], a binClstPath instance) levels.

## Select an individual out of the stack

For ease of use, the function slct() allows selecting an individual's clustering out of the population level;

bcInd1 <- slct(mystck,1)


As usual, it is not necessary to instantiate each individual;

sctr(slct(mystck, 1))  # left panel
sctr(slct(mystck, 3))  # right panel

# sctr(slct(mystck,1)) yields the same output as sctr(bcInd1) or sctr([email protected][[1]]);


## Comparing individual's behaviour with population's average behaviour

We can use all the functionality of a binClstPath object that allows comparisons (e.g. sctr(), lblp(), cnfm()) to make numeric assessments or visualizations of diferences among individuals or among individuals and population:

• we can compare individuals with their correspondent out of the population clustering;
cnfm(stbc(expth1, info=-1), slct(mystck, 1))
# stbc(expth1, info=-1) is the individual level clustering corresponding to individual 1;
# slct(mystck, 1) is the population level clustering corresponding to individual 1;

• or we can compare individuals within the population;
lblp(slct(mystck, 1), slct(mystck, 2))


# References

## Try the EMbC package in your browser

Any scripts or data that you put into this service are public.

EMbC documentation built on May 7, 2018, 5:04 p.m.