Simple random selection of testing and training data in structured environment leads to underestimation of error in the evaluation of spatial predictions and may result in inappropriate model selection (Telford and Birks, 2009; Roberts et al., 2017). The use of spatial and environmental blocks to separate test and train sets has been suggested as a good strategy for realistic error estimation in datasets with dependence structures, and more generally as a robust method for estimating predictive performance of models used to predict mapped distributions (Roberts et al., 2017). Package blockCV provides functions to separate train and test sets using buffers, spatial and environmental blocks. It provides several options for how those blocks are constructed. It also has a function that applies geostatistical techniques to investigate the existing level of spatial autocorrelation in the covariates to inform the choice of a suitable distance band by which to separate the data sets. In addition, some visualization tools are provided to help the user choose the block size and explore generated folds. The package has been written with Species Distribution Modelling (SDM) in mind, and the functions allow for a number of common scenarios (including presence-absence and presence-background species data, rare and common species, raster data for predictor variables). However, it can be used for any kind of spatial modelling.
You can find more information about blocking strategies of blockCV package and in general block cross-validation technique in the package associated paper here.
This document presents the main functions of the package and illustrates its usage with three examples: modelling using randomForest, maxnet (new implementation of Maxent software in R) and biomod2 packages.
The blockCV
is available in GitHub and it will be available in CRAN soon. It is recommended to install the dependencies of the package. To install the package from GitHub use:
devtools::install_github("rvalavi/blockCV")
# loading the package library(blockCV)
The package contains the raw format of the following data:
These data are used to illustrate how the package is used. The raster data include several bioclimatic and topographic variables from Australian Wet Tropic region aggregated to 800 m resolution. The species data contains records of a species, simulated based on the above environmental variables for the region. There are two .csv files with presence-absence and presence-background data.
To load the package raster data use:
# loading raster library library(raster) # import raster data awt <- raster::brick(system.file("extdata", "awt.grd", package = "blockCV"))
The presence absence species data include 116 presence points and 138 absence points. The appropriate format of species data for the blockCV package is SpatialPointsDataFrame (or sf). We convert the data.frame to SpatialPointsDataFrame as follows:
# import presence-absence species data PA <- read.csv(system.file("extdata", "PA.csv", package = "blockCV")) # make a SpatialPointsDataFrame object from data.frame pa_data <- SpatialPointsDataFrame(PA[,c("x", "y")], PA, proj4string=crs(awt)) # see the first few rows head(pa_data) # plot species data on the map plot(awt[[1]]) # plot raster data points(pa_data[which(pa_data$Species==1), ], col="red") # add presence points points(pa_data[which(pa_data$Species==0), ], col="blue") # add absence points legend(x=500000, y=8250000, legend=c("Presence","Absence"), col=c(2, 4), pch=c(1,1), bty="n")
The presence background data include 116 presence points and 10,000 random background points (0s here).
# import presence-background species data PB <- read.csv(system.file("extdata", "PB.csv", package = "blockCV")) # make a SpatialPointsDataFrame object from data.frame pb_data <- SpatialPointsDataFrame(PB[,c("x", "y")], PB, proj4string=crs(awt)) # number of presence and background records table(pb_data$Species)
The function spatialBlock creates spatially separated folds based on a pre-specified distance (cell size of the blocks). It then assigns blocks to the training and testing folds with random, checkerboard pattern or in a systematic manner. Another blocking strategy provided by this function is to divide the study area into vertical or horizontal bins of a given number of rows/colmuns, as used by Bahn & McGill (2013) and Wenger & Olden (2012) respectively.
To keep the consistency with other functions, the distance (theRange
) should be in metres, regardless of the unit of the reference system of the input data. When the input map has geographic coordinate system (decimal degrees), the block size is calculated based on dividing theRange by 111325 (the standard distance of a degree in metres, on the Equator) to change metre to degree. This value can be changed by the user via the degMetre
argument.
The xOffset
and yOffset
can be used to shift the spatial position of the blocks in horizontal and vertical axes, respectively. This only works when the block have been built based on theRange
. The blocks
argument allows users to define an external spatial polygon as blocking layer. The polygon layer must cover all the species points. In addition, blocks can be masked by species spatial data. This option keeps the blocks that cover species data and remove the rest.
# spatial blocking by specified range with random assignment sb <- spatialBlock(speciesData = pa_data, species = "Species", rasterLayer = awt, theRange = 68000, # size of the blocks k = 5, selection = "random", iteration = 250, # find evenly dispersed folds biomod2Format = TRUE, xOffset = 0, # shift the blocks horizontally yOffset = 0)
# spatial blocking by rows and columns with checkerboard assignment sb3 <- spatialBlock(speciesData = pa_data, species = "Species", rasterLayer = awt, rows = 5, cols = 6, k = 5, selection = "checkerboard", maskBySpecies = TRUE, biomod2Format = TRUE)
The assignment of folds to each block can also be done in a systematic manner using selection = "systematic"
argument.
# spatial blocking by rows with systematic assignment sb2 <- spatialBlock(speciesData = pa_data, species = "Species", rasterLayer = awt, rows = 6, k = 3, selection = "systematic", biomod2Format = TRUE)
For visualising the species data on top of the spatial blocks, one can use geom_point function of the ggplot2 package. However, a more sophisticated way of plotting each fold separately is presented in the visualisation tools section.
# adding points on saptialBlock plot sb$plots + geom_point(data = as.data.frame(coordinates(pa_data)), aes(x=x, y=y), alpha=0.6)
The function Buffering generates spatially separated train and test folds by considering buffers of specified distance around each observation point. This approach is a form of leave-one-out cross-validation. Each fold is generated by excluding nearby observations around each testing point within the specified distance (ideally the range of spatial autocorrelation). In this method the test set never directly abuts a training presence or absence.
When working with presence-background (presence and pseudo-absence) data (specified by spDataType
argument), only presence records are used for specifying the folds. Consider a target presence point. The buffer is defined around this target point, using the specified range (theRange
). The testing fold comprises the target presence point and all background points within the buffer. Any non-target presence points inside the buffer are excluded. All points (presence and background) outside of buffer are used for training set. The method cycles through all the presence data, so the number of folds is equal to the number of presence points in the dataset.
For presence-absence data, folds are created based on all records, both presences and absences. As above, a target observation (presence or absence) forms a test point. All presence and absence points other than the target point within the buffer are ignored, and the training set comprises all presences and absences outside the buffer. Apart from the folds, the number of training-presence, training-absence, testing-presence and testing-absence points is stored and returned in the records
table. If species = NULL
, the procedure is like presence-absence data. The species
argument is the name of the column with 0s (absences or backgrounds) and 1s (presences) in the species SpatialPointsDataFrame or sf object.
# buffering with presence-absence data bf1 <- buffering(speciesData = pa_data, species = "Species", # to count the number of presences and absences theRange = 68000, spDataType = "PA", progress = T)
In the following buffering example, presence-background data are used. As explained above, by default the background data within any target point will remain in the testing fold. This can be changed by setting addBG = FALSE
(this option only works when spDataType = "PB"
; note the default value is "PA"
).
# buffering with presence-background data bf2 <- buffering(speciesData = pb_data, # presence-background data species = "Species", theRange = 68000, spDataType = "PB", addBG = TRUE, # add background data to testing folds progress = T)
The function envBlock uses clustering methods to specify sets of similar environmental conditions based on the input covariates. Species data corresponding to any of these groups or clusters are assigned to a fold.
As k-means algorithms use Euclidean distance to estimate clusters, the input covariates should be quantitative variables. Since variables with wider ranges of values might dominate the clusters and bias the environmental clustering (Hastie et al., 2009), all the input rasters are first standardized within the function. This is done either by normalizing based on subtracting the mean and dividing by the standard deviation of each raster (the default) or optionally by standardizing using linear scaling to constrain all raster values between 0 and 1. By default, the clustering is done in the raster space. In this approach, the clusters will be consistent throughout the region and across species (in the same region). However, this may result in cluster(s) that cover none of the species records especially when species data is not dispersed throughout the region or the number of clusters (k or folds) is high. In this case, the number of folds is less than the specified k
. If rasterBlock = FALSE
, the clustering will be done based only on the values of the predictors at the species presence and absence/background points. In this case, and the number of the folds will be the same as k
.
Note that the input raster layer should cover all the species points, otherwise an error will rise. The records with no raster value should be deleted prior to the analysis.
# environmental clustering eb <- envBlock(rasterLayer = awt, speciesData = pa_data, species = "Species", k = 5, standardization = "standard", # rescale variables between 0 and 1 rasterBlock = FALSE, numLimit = 50)
To support a first choice of block size, prior to any model fitting, package blockCV includes the option for the user to look at the existing autocorrelation in the predictors, as an indication of landscape spatial structure in their study area. The tool does not suggest any absolute solution to the problem, but serves as a guide to the user. The function works by automatically fitting variograms to each continuous raster and finding the effective range of spatial autocorrelation. Variogram is a fundamental geostatistical tool for measuring spatial autocorrelation. It does so by assessing variability between all pairs of points (O’Sullivan and Unwin, 2010). It provides information about the effective range of spatial autocorrelation which is the range over which observations are independent.
sac <- spatialAutoRange(rasterLayer = awt, sampleNumber = 5000, border = NULL, showPlots = TRUE, plotVariograms = FALSE, doParallel = FALSE)
The plotted block size is based on the median of the spatial autocorrelation ranges. This could be as the minimum block size for creating spatially separated folds. Variograms are computed taking a number of random points (5000 as default) from each input raster file, using parallel processing to speed up the computation (optional). The variogram fitting procedure is done using automap package (Hiemstra et al., 2009), using the isotropic variogram and assuming the data meets the geostatistical criteria e.g. stationarity.
The output object of this function is an SpatialAutoRange object, an object of class S3.
# class of the output result class(sac)
To see the details of the fitted variograms:
# summary statistics of the output summary(sac)
To visualise them:
plot(sac$variograms[[1]])
Package blockCV provides two major visualisation tools for graphical exploration of the generated folds and assisting in block size selection. These tools have been developed as local web applications using R-package shiny. With rangeExplorer, the user can choose among block sizes in a specified range, visualise the resulting blocks interactively, viewing the impact of block size on number and arrangement of blocks in the landscape (and optionally on the distribution of species data in those blocks). The foldExplorer tool displays folds and the number of records in each fold; it works for all three blocking methods.
# explore generated folds foldExplorer(blocks = sb, rasterLayer = awt, speciesData = pa_data)
# explore the block size rangeExplorer(rasterLayer = awt) # the only mandatory input # add species data to add them on the map rangeExplorer(rasterLayer = awt, speciesData = pa_data, species = "Species", rangeTable = NULL, minRange = 30000, # limit the search domain maxRange = 100000)
Note that the interactive plots cannot be shown here, as they require opening an external window or web browser. When using rangeExplorer, slide to the selected block size, and click Apply Changes to change the block size.
In this section, we show how to use the folds generated by blockCV in the previous sections for the evaluation of species distribution models constructed on the species data available in the package. The blockCV stores training and testing folds in three different formats. The common format for all three blocking strategies is a list of the id of observations in each fold. For spatialBlock and envBlock (not buffering) the folds are also stored in a matrix format suitable for the biomod2 package and a vector of fold's number for each observation. This is equal to the number of observation in species spatial data. These three formats are stored in the blocking objects as folds, biomodTable and foldID respectively. We show three modelling examples which cover both the use of presence-absence and presence-background methods.
The code below shows how to evaluate a presence-background model, where maxnet package is used for model fitting. Maxnet is a newly developed package by Phillips et. al., (2017) to model species distributions from occurrences and environmental variables, using glmnet for model fitting. The maxnet package is the implementation of Maxent software in R programming language.
# loading the libraries library(maxnet) library(plotROC) # extract the raster values for the species points as a dataframe mydata <- extract(awt, pb_data, df=TRUE) # the extract function creates an ID column that should be excluded for the modelling # remove the extra column mydata <- mydata[,2:ncol(mydata)] # create a vector of 1 (for presence) and 0 (for background) pb <- pb_data$Species # extract the folds in buffering object created in the previous section (with presence-background data) folds <- bf2$folds # create an empty vector to store the AUC of each fold AUCs <- vector() for(k in 1:length(folds)){ trainSet <- unlist(folds[[k]][1]) # extract the training set indices testSet <- unlist(folds[[k]][2]) # extract the testing set indices # fitting a maxent model using linear, quadratic and hinge features mx <- maxnet(pb[trainSet], mydata[trainSet, ], maxnet.formula(pb[trainSet], mydata[trainSet, ], classes="lqh")) testTable <- pb_data@data[testSet, ] # a table for testing predictions and reference data testTable$pred <- predict(mx, mydata[testSet, ], type="cloglog") # predict the test set # calculate AUC using calc_auc function in plotROC package auc <- calc_auc(ggplot(testTable, aes(m=pred, d=Species)) + geom_roc(n.cuts = 0))[3] AUCs[k] <- as.numeric(auc) } # print the mean and standard deviation of AUCs print(mean(AUCs)) print(sd(AUCs))
In the second example, we use blocking for evaluating a presence-absence model created using the Random Forest algorithm. Folds generated by buffering are used here (a training and testing fold for each record).
# loading the libraries library(randomForest) library(plotROC) # extract the raster values for the species points as a dataframe mydata <- extract(awt, pa_data, df=TRUE) # adding species column to the dataframe mydata$Species <- as.factor(pa_data$Species) # remove extra column (ID) mydata <- mydata[,2:ncol(mydata)] # extract the folds in BufferedBlock object created in the previous section folds <- bf1$folds # create a data.frame to store the prediction of each fold (record) testTable <- pa_data@data testTable$pred <- NA for(k in 1:length(folds)){ trainSet <- unlist(folds[[k]][1]) # extract the training set indices testSet <- unlist(folds[[k]][2]) # extract the testing set indices rf <- randomForest(Species~., mydata[trainSet, ], ntree=250) # model fitting on training set testTable[testSet,"pred"] <- predict(rf, mydata[testSet, ], type="prob")[,2] # predict the test set } # calculate Area Under the ROC curve and plot the result using plotROC package ggROC <- ggplot(testTable, aes(m=pred, d=Species)) + geom_roc(n.cuts=0, color='red') + coord_equal() + geom_abline(intercept = 0, slope = 1) + theme_bw() + labs(x="False positive rate (1 - specificity)", y="True positive rate (sensitivity)") auc <- calc_auc(ggROC)[3] plot(ggROC + ggtitle('', subtitle=paste("AUC for testing dataset:", signif(auc, 4))))
Note that in this form of folds (buffering using presence-absence data or with species = NULL
) there is only one point in each testing fold, and therefore AUC cannot be calculated for each fold. Instead, the value of each point is first predicted, and then a unique AUC is calculated for the full set of predictions.
Package biomod2 (Thuiller et al., 2017) is a commonly used platform for modelling species distributions in an ensemble framework. In this example, we show how to use blockCV folds in biomod2. In this example, the DataSplitTable generated by spatialBlock is used to evaluate three modelling methods implemented in biomod2. The DataSplitTable can be generated by both spatialBlock()
and envBlock()
functions and it is stored as biomodTable in their output objects.
# loading the library library(biomod2) # species occurrences DataSpecies <- read.csv(system.file("extdata", "PA.csv", package = "blockCV")) # the name of studied species myRespName <- "Species" # the presence/absences data for our species myResp <- as.numeric(DataSpecies[,myRespName]) # the XY coordinates of species data myRespXY <- DataSpecies[,c("x","y")] # change the RasterBrick to RasterStack awt <- stack(awt) # 1. Formatting Data myBiomodData <- BIOMOD_FormatingData(resp.var = myResp, expl.var = awt, # explanatory raster data resp.xy = myRespXY, resp.name = myRespName, na.rm = TRUE) # 2. Defining the folds for DataSplitTable # note that biomodTable should be used here not folds DataSplitTable <- sb$biomodTable # use generated folds from spatialBlock in previous section # 3. Defining Models Options using default options. myBiomodOption <- BIOMOD_ModelingOptions() # 4. Model fitting myBiomodModelOut <- BIOMOD_Modeling( myBiomodData, models = c('GLM','MARS','GBM'), models.options = myBiomodOption, DataSplitTable = DataSplitTable, # blocking folds VarImport = 0, models.eval.meth = c('ROC'), do.full.models=FALSE, modeling.id="test")
# 5. Model evaluation # get all models evaluation myBiomodModelEval <- get_evaluations(myBiomodModelOut) myBiomodModelEval["ROC","Testing.data",,,]
Note that the result of this section (biomod model evaluation) is not shown.
Bahn, V., & McGill, B. J. (2012). Testing the predictive performance of distribution models. Oikos, 122(3), 321–331.
Hiemstra, P. H., Pebesma, E. J., Twenhöfel, C. J., & Heuvelink, G. B. (2009). Real-time automatic interpolation of ambient gamma dose rates from the Dutch radioactivity monitoring network. Computers & Geosciences, 35(8), 1711–1721.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data Mining, Inference, and Prediction (2nd ed., Vol. 1). Springer series in statistics New York.
O’Sullivan, D., & Unwin, D. J. (2010). Geographic Information Analysis (2nd ed.). John Wiley & Sons.
Phillips, S. J., Anderson, R. P., Dudík, M., Schapire, R. E., & Blair, M. E. (2017). Opening the black box: an open-source release of Maxent. Ecography.
Roberts, D.R., Bahn, V., Ciuti, S., Boyce, M.S., Elith, J., Guillera-Arroita, G., Hauenstein, S., Lahoz-Monfort, J.J., Schröder, B., Thuiller, W., others, 2017. Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography. 40: 913-929.
Telford, R. J., & Birks, H. J. B. (2009). Evaluation of transfer functions in spatially structured environments. Quaternary Science Reviews, 28(13), 1309–1316.
Thuiller, W., Georges, D., Engler, R., & Breiner, F. (2017). biomod2: Ensemble Platform for Species Distribution Modeling. R package version 3.3-7. https://CRAN.R-project.org/package=biomod2.
Valavi, R., Elith, J., Lahoz-Monfort, J. J., & Guillera-Arroita, G. (2018). blockCV: an R package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models. Methods in Ecology and Evolution, doi: 10.1111/2041‐210X.13107.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.