knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

Working with machine learning methods and spatial data can be computationally expensive. Especially for larger scale applications like global maps (Ludwig et. al 2024) or high resolution data (e.g. from drones) computation times of CAST methods can be quite long. This vignette goes over various options for efficient spatial modelling workflows and parallelization options in order to speed up computation times of CAST methods.

Forward feature selection

As the forward feature selection basically is a brute force grid search of the best predictor variable combination, during ffs a large number of machine learning models are trained to compare their performance. These model trainings can be run in parallel. On Unix machines, this is implemented using mclapply from the parallel-package. Simply set the cores argument to a value > 1 to use multiple cores for the model training. On windows machines, you will get the warning Parallel computations of ffs only implemented on unix systems. cores is set to 1.

data("splotdata")
spatial_cv = CreateSpacetimeFolds(splotdata, spacevar = "Biome", k = 5)
ctrl <- trainControl(method="cv",index = spatial_cv$index)

ffsmodel <- ffs(predictors = splotdata[,6:16],
                response = splotdata$Species_richness,
                tuneLength = 1,
                method = "rf",
                trControl = ctrl,
                ntree = 20,
                seed = 1,
                cores = 4)

Regardless of the system, you can speed up the model training in ffs, as caret::train has a parallelization option that trains models on multiple cores. The code below is adapated from https://topepo.github.io/caret/parallel-processing

library(doParallel)

data("splotdata")
spatial_cv = CreateSpacetimeFolds(splotdata, spacevar = "Biome", k = 4)
ctrl <- trainControl(method="cv",index = spatial_cv$index)

cl <- makeCluster(4)
registerDoParallel(cl)
ffsmodel <- ffs(predictors = splotdata[,6:16],
                response = splotdata$Species_richness,
                tuneLength = 4,
                method = "rf",
                trControl = ctrl,
                ntree = 20,
                seed = 1,
                cores = 1)
stopCluster(cl)

Another option is the use of a different model algorithm. For example, the ranger package is a fast and multicore implementation of the random forest model.

ffsmodel <- ffs(predictors = splotdata[,6:16],
                response = splotdata$Species_richness,
                method = "ranger")

Area of applicability

Estimating the Area of Applicability (AOA) can be computationally expansive as the method is based on finding minimum distances and nearest neighbors. We can divide the computation into three major chunks:

  1. The feature space distance between all training data.
  2. The nearest feature space distance between training data in different CV folds.
  3. The distance between new locations (pixel) and the nearest training data in feature space.

(1.) and (2.) are needed to derive the AOA threshold and solely based on the training data and cross validation configuration. They are computed together in the trainDI function. (3.) is needed to calculate the dissimilarity index of the predictor stack in the aoa function. Under the hood, the aoa function calls trainDI as it's first step and then calculates the DI of the predictor stack.

You can speed up computation times of aoa by splitting these two processes and calculate the DI on multiple raster tiles at once.

trainDI

library(CAST)
library(caret)
library(terra)
library(sf)

data("splotdata")
predictors <- rast(system.file("extdata","predictors_chile.tif",package="CAST"))
splotdata <- st_drop_geometry(splotdata)
set.seed(10)
model <- train(splotdata[,names(predictors)],
               splotdata$Species_richness,
               method="rf",
               tuneLength = 1,
               importance=TRUE,
               ntrees = 20,
               trControl = trainControl(method="cv"), number = 5)
prediction <- predict(predictors, model, na.rm=TRUE)

Calling trainDI just needs the model object, as this also contains the training data, cross-validation information and variable importance. The result of trainDI then contains all the necessary information from the training data and model that the aoa function needs to compute the DI for new locations.

tdi = trainDI(model, verbose = FALSE)
print(tdi)
class(tdi)
str(tdi)
# you can save the trainDI object for later application
saveRDS(tdi, "path/to/file")

AOA for multiple rasters

If you have a very large area of interest for which you want to derive the Area of Applicability, computing the AOA for the entire area at once might not be possible or takes a very long time. You can use a tile based approach to apply the trainDI object from the previous step to multiple rasters at once.

r1 = crop(predictors, c(-75.66667, -67, -30, -17.58333))
r2 = crop(predictors, c(-75.66667, -67, -45, -30))
r3 = crop(predictors, c(-75.66667, -67, -55.58333, -45))


plot(r1[[1]],main = "Tile 1")
plot(r2[[1]],main = "Tile 2")
plot(r3[[1]],main = "Tile 3")

Use the trainDI argument in the aoa function to specify, that you want to use a previously computed trainDI object.

aoa_r1 = aoa(newdata = r1, trainDI = tdi)

plot(r1[[1]], main = "Tile 1: Predictors")
plot(aoa_r1$DI, main = "Tile 1: DI")
plot(aoa_r1$AOA, main = "Tile 1: AOA")

You can now run the aoa function in parallel on the different tiles! Of course you can use for favorite parallel backend for this task, here we use mclapply from the parallel package.

library(parallel)

tiles_aoa = mclapply(list(r1, r2, r3), function(tile){
  aoa(newdata = tile, trainDI = tdi)

}, mc.cores = 3)
tiles_aoa = lapply(list(r1, r2, r3), function(tile){
  aoa(newdata = tile, trainDI = tdi)
})
plot(tiles_aoa[[1]]$AOA, main = "Tile 1")
plot(tiles_aoa[[2]]$AOA, main = "Tile 2")
plot(tiles_aoa[[3]]$AOA, main = "Tile 3")

For even larger tasks it might be useful to save the tiles to you hard-drive and load them one by one to avoid filling up your RAM.

# Simple Example Code for raster tiles on the hard drive

tiles = list.files("path/to/tiles", full.names = TRUE)

tiles_aoa = mclapply(tiles, function(tile){
  current = terra::rast(tile)
  aoa(newdata = current, trainDI = model_random_trainDI)

}, mc.cores = 3)


environmentalinformatics-marburg/CAST documentation built on April 27, 2024, 2:07 p.m.