Installing sgdm

In order to install sgdm from GitHub, you need to install the devtools package first. Using the install_git function in the devtools package, the sgdm package can be easily installed:

library(devtools)
devtools::install_github("sparsegdm/sgdm_package")
library(sgdm)
library(sgdm)

Note 1: On a Linux (Ubuntu) machine things might be a little more complicated and you first need to install openssl for linux:

Before you can install the devtools package.

Note 2: The PMA package, which is required by SGDDM and installed automatically, needs the impute package from the bioclite repository. However, it seems that PMA doe not automatically load impute. If you haven't installed either PMA or impute before, you need to first install the impute package:

source("https://bioconductor.org/biocLite.R")
biocLite("impute")

Exemplary data

The package includes nine functions, and three exemplary datasets. The exemplary datasets include one biological dataset, one predictor dataset and one predictor map. The trees biological dataset is composed of 30 observations with abundance values for 48 different tree families in an area of natural vegetation in the Brazilian Cerrado. The spectra predictor dataset is composed of the same 30 observations, with reflectance values for 83 narrow spectral bands; extracted from spaceborne hyperspectral Hyperion imagery after pre-processing and band quality screening. Both datasets include an ID column and the later also include two geographical coordinate (X and Y) columns. The spectral.image predictor map is a raster object constituting of a subset (100 times 100 pixels) of the respective Hyperion image.

Running a sgdm model

Running a full sgdm model requires five steps:

In the following we will go through each of the four steps in order to examplify how the sgdm package works.

Parameterize and train the SGDM model

For parameterizing a sgdm model, it is necessary to identify the best penalization parameter pair using a heuristic grid search implemented in the function sgdm.param:

sgdm.gs <- sgdm.param(predData = spectra, bioData = trees, k = 30, 
                      predPenalization = seq(0.6, 1, 0.1), bioPenalization = seq(0.6, 1, 0.1), geo = F)

The parameter k sets the number of components to be used and the parameters predPenalization and bioPenalization set the penalization values to be tested (ranging from 0 (strong penalization) to 1 (weak penalization). The standard values implemented in this function range from 0.6 to 1 in 0.1 steps, but these values can be manually configured to better match the used datasets. Finally, the user can specify if geographical distance as a variable in the GDM model should be used. Note that this function implementation only allows biological data in format 1 specification as described in the gdm package.

The function sgdm.param returns a performance matrix, with the performance values (RMSE) of each parameter pair estimated from 5-fold cross-validation:

print(sgdm.gs, digits = 4)

Using the performance matrix and the sgdm.best function, we can retreive the best model (output = "m"):

sgdm.model <- sgdm.best(perf.matrix = sgdm.gs, predData = spectra, bioData = trees, output = "m", k = 30)

Summary of the model:

summary(sgdm.model)

Alternatively, the sgdm.best function also allows the user to retrieve the resulting sparse canonical components (output = "c") or the respective canonical vectors (output = "v").

sgdm.sccbest <- sgdm.best(perf.matrix = sgdm.gs, predData = spectra, bioData = trees, output = "c", k = 30)
sgdm.vbest <- sgdm.best(perf.matrix = sgdm.gs, predData = spectra, bioData = trees, output = "v", k = 30)

Reduce the SGDM model by removing non-significant predictors

Some of the sparse canonical components might be less important than others. In order to reduce the model complexity, we can utilize the gdm.varsig function with the sparse canonical components and the biological data:

sigtest.sgdm <- gdm.varsig(predData = sgdm.sccbest, bioData = trees)

and use this significance test result to reduce the predictor data for re-training the model:

sgdm.sccbest.red <- data.reduce(data = sgdm.sccbest, datatype = "pred", sigtest = sigtest.sgdm)

In order to train the final model, the significant sparse canonical components must be combined with the biological dataset in a site pair dataset using the function formatsitepair in the gdm package:

spData.sccabest.red <- gdm::formatsitepair(bioData = trees, bioFormat = 1, dist = "bray",
                                           abundance = TRUE, siteColumn = "Plot_ID", 
                                           XColumn = "X",YColumn = "Y", predData = sgdm.sccbest.red)

sgdm.model.red <- gdm::gdm(data = spData.sccabest.red)

The final model can now be inspected and the predicted dissimilarities plotted:

summary(sgdm.model.red)
plot(sgdm.model.red$predicted, sgdm.model.red$observed, xlim = c(0, 1), ylim = c(0, 1))
abline(0, 1)

Validate the SGDM model

For independent validation, there is a function in the sgdm package called gdm.cv, which performes n-fold cross-validation on a (S)GDM model:

gdm.cv(spData = spData.sccabest.red, nfolds = 10)

The functions returns the cross-validated RMSE value (default). Alternatively, the cross-validated coefficient of determination (r$^2$) can be returned:

gdm.cv(spData = spData.sccabest.red, nfolds = 10, performance = "r2")

Map community composition patterns

The community composition patterns can be plotted along the main axes of variation following an NMDS transformation of the predicted dissimilarities between sample pairs:

community.samples <- gdm.map(spData = spData.sccabest.red, model = sgdm.model.red, k = 0, t = 0.1)

The number of NMDS axes to be extracted is by default automatically determined by the resulting mean NMDS stress values out of 20 iterations following:

The number of NMDS axes (k) can also be set by the user.

These patterns can also be mapped in space using a predictor map. As it would be unfeasible to run a NMDS on the dissimilarities between all possible pairs of image pixels, it is possible to assign the pixels to the NMDS axes from the sample pairs, through knn-imputation.

The predictor map used here is called spectral.image and is a RasterStack object, which can be plotted in false color:

raster::plotRGB(spectral.image, r = 43, g = 22, b = 12, stretch = "hist")

In order to map the community composition patterns predicted by sgdm, it is necessary to apply the canonical transformation onto the prediction map. The resulting component map must also to be reduced according to the variable significance test:

component.image <- predData.transform(predData = spectral.image, v = sgdm.vbest)
component.image.red <- data.reduce(component.image, datatype = "pred", sigtest = sigtest.sgdm)

With the resulting reduced component image, it is possible to map the community composition patterns in space. For example, in order to visualize them in RGB space, it is possible to extract 3 NMDS axes:

map.sgdm.red <- gdm.map(spData = spData.sccabest.red, predMap = component.image.red, model = sgdm.model.red, k = 3)
raster::plotRGB(map.sgdm.red, r = 3, g = 2, b = 1, stretch = "hist")


sparsegdm/sgdm_package documentation built on May 30, 2019, 6:35 a.m.