In UtrechtUniversity/SCCA: Spectral Clustering Correspondence Analysis in R

knitr::opts_chunk$set(collapse = T, comment = "#>")
options(tibble.print_min = 4, tibble.print_max = 4)

Introduction

In this vignette we walk you through a useful application of the package SCCA in the field of ecology. Specifically, it does so using a case study for global carnivores, allowing to replicate the results in van Dam, et al.

We will show how to perform Correspondence Analysis to an incidence (presence-absence) matrix representing the distribution of the global Carnivora, including both marine and terrestrial species. We demonstrate the most common functions to obtain bioregions and/or chorotypes and to visualize analytic results.

Installing and loading the package

If not already done so, you can install the package from GitHub. Then load the package.

## install and load packages
#library(devtools)
#install_github("UtrechtUniversity/SCCA", build_vignettes = TRUE)
library(SCCA)
set.seed(0)

SCCA makes use of K-means clustering. To make results of one run to another comparable, we allow the user to set the seed for the K-means algorithm. For more information click here.

Data on global Carnivora

The package has a pre-loaded incidence matrix of presences-absences of global carnivores (Mammalia) from both land and oceans. The data set is downloaded from DRYAD. A quick glance to the data:

dim(carnivora)

head(carnivora[14000:14006, 150:160])

The data contains a binary matrix of sites x species (rows x columns) with values of 0 indicating absence and values of 1 indicating presence of a given species j in site i. Specifically, there are 41580 sites representing ca. 100 x 100 km grid cells distributed globally and with at least one presence. There are 288 species including marine and terrestrial carnivores.

We can have a look at the geographic distribution of those sites by loading the pre-loaded coordinates of the carnivora dataset, which are named carnivora_sites. Also, we can list which species are represented in each column with teh pre-loaded carnivora_species.

# retrieve coordinates
sites <- carnivora_sites

# inspect 
head(sites)

# retrieve species
carnivores <- carnivora_species

# inspect 
head(carnivores)

# plot geographic coordinates of sites
plot(sites$lon, sites$lat,
  col = adjustcolor('grey',0.4),
  pch = 19,
  cex=0.3
)

Using SCCA

Once familiar with the data we can now review the basic functions in the package. In this example, we apply Correspondence Analyses aiming to analyze sites according to their species composition (but we could have as well aimed at analyzing species based on their geographic distributions by applying the same procedure to columns instead of rows). The analysis yields a number of clusters in which data can be classified, and which are interpreted as bioregions.

`scca_compute()`: apply CA to a matrix m

The function scca_compute performs a hierarchical, Spectral Clustering Correspondence Analysis on a matrix M representing a bi-partite network. The process consists of the following steps:

Computation of eigenvalues and eigenvectors of the similarity matrix derived from M.
Determine K, being how many relevant eigenvectors should be found.
Apply K-means to find a clustering of the elements of M into K clusters.

This is an iterative process that leads to an hierarchical clustering. The function to calculate a number of clusters is a heuristic eigengap_heuristic, which inspects the eigenspectrum and finds the number of axis before the largest eigengap or distance between each axis and the next. Multiple eigenvalues of 1 indicates fully disconnected components and are prioritised in being clustered separately. The heuristic also provides us with the stopping condition: if it results in 1 cluster being optimal, the particular branch stops being clustered further.

We can apply the function to the carnivora dataset, generate an object that we call scca and inspect the results using the function scca_print().

# application of this function to a large dataset sucha as the carnivora dataset here
# can take up to a minute.
scca <- scca_compute(m = carnivora, 
  iter.max = 10,
  nstart   = 100,
  disconnect.rm   = TRUE,
  max_eigenvalues = 25,
  decomp    = "svd",
  max_depth = Inf,
  heuristic = eigengap_heuristic)

scca_print(scca , 'k', 'n_labs', 'depth', 'node_type')

`scca_print()`: print hierarchical clustering

This function shows the hierarchical structure of fitted clusters.

## inspect the structure of object scca
scca_print(scca , 'k', 'n_labs', 'depth', 'node_type')

There are r length(unique(scca_get_clusters(scca)$cluster)) leaf nodes or clusters identified by the heuristic. The process of K-means clustering has a stochastic component in the initialisation, that is, the choice of the starting centroids, which are random. The subsequent algorithm of optimising the clustering is deterministic, but still, the random initialisation might lead to stochasticity. We advise the user to try out different seeds to investigate the sensitivity of the process to this stochastic effect in K-means, which differs per system.

The attributes asked to be printed in scca_print() have the following meaning: n_labs is the number of observations (labels), k is the number of relevant Eigenvalues (i.e. before the eigengap) at each branch node, node_type is the type of branch (internal) or leaf (terminal) nodes. If no attribute is given, then only the tree hierarchy is printed.

`scca_get_clusters()`: retrieve and plot/map clusters

The function scca_get_clusters saves the observations and the cluster it has been assigned to, in an object. which allows to map obtained clusters or bioregions.

# save the clusters in object scca
(clusts<-scca_get_clusters(scca))

## how many clusters?
length(unique(clusts$cluster))


# plotting clusters
plot(sites$lon,sites$lat,
     col = adjustcolor('grey',0.4),
     pch = 19,
     cex = 0.3)

for(i in sort(unique(clusts$cluster))) {#i=2
  points(
    sites$lon[which(clusts$cluster==i)],
    sites$lat[which(clusts$cluster==i)],
    pch = 19,
    cex = 0.5,
    col = adjustcolor(hcl.colors(22)[i],0.6))
}

Regions shaded in different colors represent each cluster or bioregion. The loop above plots all clusters sequentially, but it would be easy to plot any single cluster by adjusting the range of values for i.

`scca_get_node()`: retrieve any given node (axis) and plot its eigen-spectrum

This function retrieves the values for any given node in object scca, which allows to, for example plot eigenspectrums. Type ?scca_get_node at the R console to see which information is stored in a node. Not only for leaf nodes but also for bracnch nodes.

node1 <- scca_get_node(scca, node = 1)

## check eigenvalues:

## plot eigenspectrum
plot(
  x = 1:length(node1$spectrum),
  y = node1$spectrum,
  ylab = "Eigenvalues",
  xlab = "Index")

There is however, a more direct function to plot the eigen spectrum for any given node: scca_plot_spectrum().

## the same can be done directly with function scca_plot_spectrum()
scca_plot_spectrum(scca,
  node = 1,
  plot = TRUE)

These plots show the sorted spectrum for all data (node 1), showing two eigenvectors with value equal 1, indicating that there are two fully disconnected clusters. Function scc_plot_spectrum() also prints the sorted eigenvalues as a table.