knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
Working with machine learning methods and spatial data can be computationally expensive. Especially for larger scale applications like global maps (Ludwig et. al 2024) or high resolution data (e.g. from drones) computation times of CAST methods can be quite long. This vignette goes over various options for efficient spatial modelling workflows and parallelization options in order to speed up computation times of CAST methods.
As the forward feature selection basically is a brute force grid search of the best predictor variable combination, during ffs
a large number of machine learning models are trained to compare their performance. These model trainings can be run in parallel. On Unix machines, this is implemented using mclapply
from the parallel-package
. Simply set the cores
argument to a value > 1
to use multiple cores for the model training. On windows machines, you will get the warning Parallel computations of ffs only implemented on unix systems. cores is set to 1
.
data("splotdata") spatial_cv = CreateSpacetimeFolds(splotdata, spacevar = "Biome", k = 5) ctrl <- trainControl(method="cv",index = spatial_cv$index) ffsmodel <- ffs(predictors = splotdata[,6:16], response = splotdata$Species_richness, tuneLength = 1, method = "rf", trControl = ctrl, ntree = 20, seed = 1, cores = 4)
Regardless of the system, you can speed up the model training in ffs
, as caret::train
has a parallelization option that trains models on multiple cores. The code below is adapated from https://topepo.github.io/caret/parallel-processing
library(doParallel) data("splotdata") spatial_cv = CreateSpacetimeFolds(splotdata, spacevar = "Biome", k = 4) ctrl <- trainControl(method="cv",index = spatial_cv$index) cl <- makeCluster(4) registerDoParallel(cl) ffsmodel <- ffs(predictors = splotdata[,6:16], response = splotdata$Species_richness, tuneLength = 4, method = "rf", trControl = ctrl, ntree = 20, seed = 1, cores = 1) stopCluster(cl)
Another option is the use of a different model algorithm. For example, the ranger
package is a fast and multicore implementation of the random forest model.
ffsmodel <- ffs(predictors = splotdata[,6:16], response = splotdata$Species_richness, method = "ranger")
Estimating the Area of Applicability (AOA) can be computationally expansive as the method is based on finding minimum distances and nearest neighbors. We can divide the computation into three major chunks:
(1.) and (2.) are needed to derive the AOA threshold and solely based on the training data and cross validation configuration. They are computed together in the trainDI
function. (3.) is needed to calculate the dissimilarity index of the predictor stack in the aoa
function. Under the hood, the aoa
function calls trainDI
as it's first step and then calculates the DI of the predictor stack.
You can speed up computation times of aoa
by splitting these two processes and calculate the DI on multiple raster tiles at once.
library(CAST) library(caret) library(terra) library(sf) data("splotdata") predictors <- rast(system.file("extdata","predictors_chile.tif",package="CAST")) splotdata <- st_drop_geometry(splotdata)
set.seed(10) model <- train(splotdata[,names(predictors)], splotdata$Species_richness, method="rf", tuneLength = 1, importance=TRUE, ntrees = 20, trControl = trainControl(method="cv"), number = 5) prediction <- predict(predictors, model, na.rm=TRUE)
Calling trainDI
just needs the model object, as this also contains the training data, cross-validation information and variable importance. The result of trainDI
then contains all the necessary information from the training data and model that the aoa
function needs to compute the DI for new locations.
tdi = trainDI(model, verbose = FALSE) print(tdi) class(tdi) str(tdi)
# you can save the trainDI object for later application saveRDS(tdi, "path/to/file")
If you have a very large area of interest for which you want to derive the Area of Applicability, computing the AOA for the entire area at once might not be possible or takes a very long time. You can use a tile based approach to apply the trainDI
object from the previous step to multiple rasters at once.
r1 = crop(predictors, c(-75.66667, -67, -30, -17.58333)) r2 = crop(predictors, c(-75.66667, -67, -45, -30)) r3 = crop(predictors, c(-75.66667, -67, -55.58333, -45)) plot(r1[[1]],main = "Tile 1") plot(r2[[1]],main = "Tile 2") plot(r3[[1]],main = "Tile 3")
Use the trainDI
argument in the aoa
function to specify, that you want to use a previously computed trainDI object.
aoa_r1 = aoa(newdata = r1, trainDI = tdi) plot(r1[[1]], main = "Tile 1: Predictors") plot(aoa_r1$DI, main = "Tile 1: DI") plot(aoa_r1$AOA, main = "Tile 1: AOA")
You can now run the aoa function in parallel on the different tiles! Of course you can use for favorite parallel backend for this task, here we use mclapply from the parallel
package.
library(parallel) tiles_aoa = mclapply(list(r1, r2, r3), function(tile){ aoa(newdata = tile, trainDI = tdi) }, mc.cores = 3)
tiles_aoa = lapply(list(r1, r2, r3), function(tile){ aoa(newdata = tile, trainDI = tdi) })
plot(tiles_aoa[[1]]$AOA, main = "Tile 1") plot(tiles_aoa[[2]]$AOA, main = "Tile 2") plot(tiles_aoa[[3]]$AOA, main = "Tile 3")
For even larger tasks it might be useful to save the tiles to you hard-drive and load them one by one to avoid filling up your RAM.
# Simple Example Code for raster tiles on the hard drive tiles = list.files("path/to/tiles", full.names = TRUE) tiles_aoa = mclapply(tiles, function(tile){ current = terra::rast(tile) aoa(newdata = current, trainDI = model_random_trainDI) }, mc.cores = 3)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.