knitr::opts_chunk$set(echo = TRUE)
require(devtools) require(tidyr) require(dplyr) require(sp) require(stringr) require(ggplot2)
devtools::load_all()
spcv, short for spatial cross validation, is a minimalist R package for anticipating prediction errors in spatially interpolated data. For now, it’s based on hold one out cross validation, and it is meant to work with the R spatial statistics package gstat.
spcv currently can be used only to examine the magnitude and distribution errors generated by inverse distance weighted (IDW) interpolations. With minor modification it could be adapted to cross-validate other gstat interpolators (i.e. simple, ordinary, block, and universal kriging)
Inverse distance weighting (or IDW for short) is a simple but flexible interpolation method. An estimate at an unknown point is calculated from a weighted average of the values available at the known points, but the relative contribution of each point is an inverse function of distance ( usually 1 / dist(x[i]-x[j] )^p ). The power parameter (p) can be increased to lend greater weight to near-field points.
IDW differs from kriging methods in that it does not rely on distributional assumptions about the underlying data. IDW is easy to tune and implement, but it offers no upfront prediction about the bias of its estimates. With IDW, accuracy is highly site and dataset specific. Therefore it would be nice to have an easy to use tool to scrutinize IDW interpolation accuracy.
spcv::spcv() is the main function of the spcv package. It has three inputs:
Calling spcv() on a SpatialPointsDataFrame will impliment IDW interpolation with hold-one-out cross validation.
It will return a list object containing the following:
spcv:::convert_spdf_for_ggplot() is a useful function for converting SpatialPointsDataFrame for plotting in ggplot
Consider m.sample, a SpatialPointsDataFrame object consisting of 100 points in {x,y} coordinate space, containing chemical results at various concentrations (“conc”). It is easily plot using spcv::convert_spdf_for_ggplot2() (also see spcv::theme_basic() for reproducing the aesthetic below).
m.sample = m3_sample #m3_sample is part of spcv head(m.sample,3)
ggplot(spcv::convert_spdf_for_ggplot2(m.sample, "conc"), aes(x,y, col =z)) + geom_point() + spcv::theme_basic()
m.sample <- readRDS( "../data-raw/spdf_sample1.rds") grid <- readRDS( "../data-raw/hr_grid.rds") idw.1 <- gstat::idw(conc~1,m.sample, grid, idp = 1) idw.2 <- gstat::idw(conc~1,m.sample, grid, idp = 2) idw.4 <- gstat::idw(conc~1,m.sample, grid, idp = 4) idw.8 <- gstat::idw(conc~1,m.sample, grid, idp = 8) gg.1 <- ggplot(convert_spdf_for_ggplot2(idw.1), aes(x,y, fill =z)) + geom_tile() + theme_basic() + ggtitle("bandwidth = 1") gg.2 <- ggplot(convert_spdf_for_ggplot2(idw.2), aes(x,y, fill =z)) + geom_tile() + theme_basic() + ggtitle("bandwidth = 2") gg.4 <- ggplot(convert_spdf_for_ggplot2(idw.4), aes(x,y, fill =z)) + geom_tile() + theme_basic() + ggtitle("bandwidth = 4") gg.8 <- ggplot(convert_spdf_for_ggplot2(idw.8), aes(x,y, fill =z)) + geom_tile() + theme_basic() + ggtitle("bandwidth = 8")
require(gridExtra) grid.arrange(gg.1, gg.2, gg.4, gg.8, ncol = 4)
cv.1 <- spcv(df.sp = m.sample, my_idp = 1, var_name = "conc") cv.2 <- spcv(df.sp = m.sample, my_idp = 2, var_name = "conc") cv.4 <- spcv(df.sp = m.sample, my_idp = 4, var_name = "conc") cv.8 <- spcv(df.sp = m.sample, my_idp = 8, var_name = "conc") cv.16 <- spcv(df.sp = m.sample, my_idp = 16, var_name = "conc")
par(mfrow = c(1,5), mar = c(4,4,3,3)) ymx = 8; ymn = -8 plot(cv.1$cv.input$conc, cv.1$cv.error, main = paste("p =", 1), ylim = c(ymn,ymx), xlab = "hold-out value", ylab = "prediction error"); abline(0,0, col = "gray") plot(cv.2$cv.input$conc, cv.2$cv.error, main = paste("p =", 2), ylim = c(ymn,ymx), xlab = "hold-out value", ylab = "prediction error"); abline(0,0, col = "gray") plot(cv.4$cv.input$conc, cv.4$cv.error, main = paste("p =", 4), ylim = c(ymn,ymx), xlab = "hold-out value", ylab = "prediction error"); abline(0,0, col = "gray") plot(cv.8$cv.input$conc, cv.8$cv.error, main = paste("p =", 8), ylim = c(ymn,ymx), xlab = "hold-out value", ylab = "prediction error"); abline(0,0, col = "gray") plot(cv.16$cv.input$conc, cv.8$cv.error, main = paste("p =", 16), ylim = c(ymn,ymx), xlab = "hold-out value", ylab = "prediction error"); abline(0,0, col = "gray")
Low bandwidth (p) values give distant points more oomph when interpolating. By contrast, increasing the bandwidth power parameter assigns greater influence to values closest to the interpolated point. For this data, the errors associated with estimates from low-bandwidth interpolation (i.e. p < 2) are similar to using the global mean as an estimator for each unknown point. Here, medium and high concentration samples are systematically under predicted.
For this data, spatial cross validation illustrates how increasing the inverse distance weighting bandwidth parameter (p) reduces the magnitude of interpolation errors, but only up to a point. Which value of p is best? That probably depends on the goal of the study. For instance, for acute toxins, underpredicting high concentrations of contaminants may pose a larger risk than overestimating low concentrations from the standpoint of human or ecological health.
Rather than consider the entire dataset in each IDW estimate, interpolation can be restricted to consider only points within some local neighborhood (in gstat, this is specified with a max distance argument).
For this data set, spcv shows that neighborhood size and inverse distance weighting bandwidth (p) are related but not entirely redundant. Using spcv suggests that shrinking the neighborhood size appears to reduce the magnitude of interpolation errors, although diminishing returns kick in at higher bandwidth parameter values. However, there are clear reasons why we should consider local over global interpolation. In particular, using a smaller local interpolation neighborhood appears to ameliorate overprediction error for lower concentration samples.
my_maxdist = 2 cv.1.md <- spcv(df.sp = m.sample, my_idp = 1, var_name = "conc", maxdist = my_maxdist) cv.2.md <- spcv(df.sp = m.sample, my_idp = 2, var_name = "conc", maxdist = my_maxdist) cv.4.md <- spcv(df.sp = m.sample, my_idp = 4, var_name = "conc", maxdist = my_maxdist) cv.8.md <- spcv(df.sp = m.sample, my_idp = 8, var_name = "conc", maxdist = my_maxdist) cv.16.md <- spcv(df.sp = m.sample, my_idp = 16, var_name = "conc", maxdist = my_maxdist)
par(mfrow = c(1,5), mar = c(4,4,3,3)) ymx = 8; ymn = -8 plot(cv.1$cv.input$conc, cv.1$cv.error, main = paste("p =", 1, "\nmaxdist = ", my_maxdist), ylim = c(ymn,ymx), xlab = "hold-out value", ylab = "prediction error") abline(0,0, col = "gray") points(cv.1.md$cv.input$conc, cv.1.md$cv.error, col = "#0000FF50") plot(cv.2$cv.input$conc, cv.2$cv.error, main = paste("p =", 2, "\nmaxdist = ", my_maxdist), ylim = c(ymn,ymx), xlab = "hold-out value", ylab = "prediction error"); abline(0,0, col = "gray") points(cv.2.md$cv.input$conc, cv.2.md$cv.error, col = "#0000FF50") plot(cv.4$cv.input$conc, cv.4$cv.error, main = paste("p =", 4, "\nmaxdist = ", my_maxdist), ylim = c(ymn,ymx), xlab = "hold-out value", ylab = "prediction error"); abline(0,0, col = "gray") points(cv.4.md$cv.input$conc, cv.4.md$cv.error, col = "#0000FF50") plot(cv.8$cv.input$conc, cv.8$cv.error, main = paste("p =", 8, "\nmaxdist = ", my_maxdist), ylim = c(ymn,ymx), xlab = "hold-out value", ylab = "prediction error"); abline(0,0, col = "gray") points(cv.8.md$cv.input$conc, cv.8.md$cv.error, col = "#0000FF50") plot(cv.16$cv.input$conc, cv.8$cv.error, main = paste("p =", 16, "\nmaxdist = ", my_maxdist), ylim = c(ymn,ymx), xlab = "hold-out value", ylab = "prediction error"); abline(0,0, col = "gray") points(cv.16.md$cv.input$conc, cv.16.md$cv.error, col = "#0000FF50")
If things closer together in geographical space tend to be more alike, it’s tempting to push the gas on bandwidth and pump the brakes on neighborhood size (i.e., shrink the neighborhood size, while allowing all points within than neighborhood to contribute more strongly to the interpolation estimate). But eventually, I would expect the size of the data set to be contrain the efficacy of this stragegy on interpolation accuracy. If we shrink the neighborhood size too much, we will be making an estimate from an extremely small sample – accuracy in an average sense may or may not suffer, but I would expect variance to be high.
spcv::spcv helps us consider this trade-off and predict, for a given data set, when our an interpolation neighborhood might be too constrained.
my_maxdist = 1 cv.1.md1 <- spcv(df.sp = m.sample, my_idp = 1, var_name = "conc", maxdist = my_maxdist) cv.2.md1 <- spcv(df.sp = m.sample, my_idp = 2, var_name = "conc", maxdist = my_maxdist) cv.4.md1 <- spcv(df.sp = m.sample, my_idp = 4, var_name = "conc", maxdist = my_maxdist) cv.8.md1 <- spcv(df.sp = m.sample, my_idp = 8, var_name = "conc", maxdist = my_maxdist) cv.16.md1 <- spcv(df.sp = m.sample, my_idp = 16, var_name = "conc", maxdist = my_maxdist)
par(mfrow = c(1,3), mar = c(4,4,3,3)) ymx = 8; ymn = -8 plot(cv.1$cv.input$conc, cv.1$cv.error, main = paste("p =", 1, "\nmaxdist = ", my_maxdist), ylim = c(ymn,ymx), xlab = "hold-out value", ylab = "prediction error") abline(0,0, col = "gray") points(cv.1.md$cv.input$conc, cv.1.md$cv.error, col = "#0000FF50") points(cv.1.md1$cv.input$conc, cv.1.md1$cv.error, col = "#FF000050") plot(cv.2$cv.input$conc, cv.2$cv.error, main = paste("p =", 2, "\nmaxdist = ", my_maxdist), ylim = c(ymn,ymx), xlab = "hold-out value", ylab = "prediction error"); abline(0,0, col = "gray") points(cv.2.md$cv.input$conc, cv.2.md$cv.error, col = "#0000FF50") points(cv.2.md1$cv.input$conc, cv.2.md1$cv.error, col = "#FF000050") suppressWarnings(arrows(cv.2.md$cv.input$conc, cv.2.md$cv.error, cv.2.md1$cv.input$conc, cv.2.md1$cv.error, col ="#FF000050", length = .1) ) plot(cv.4$cv.input$conc, cv.4$cv.error, main = paste("p =", 4, "\nmaxdist = ", my_maxdist), ylim = c(ymn,ymx), xlab = "hold-out value", ylab = "prediction error"); abline(0,0, col = "gray") points(cv.4.md$cv.input$conc, cv.4.md$cv.error, col = "#0000FF50") points(cv.4.md1$cv.input$conc, cv.4.md1$cv.error, col = "#FF000050") suppressWarnings(arrows(cv.4.md$cv.input$conc, cv.4.md$cv.error, cv.4.md1$cv.input$conc, cv.4.md1$cv.error, col ="#FF000050", length = .1))
For this data set, notice how estimates in red (made with a small neighborhood size of 1 unit) seem to reduce the overall interpolation bias but also tend to lead to some larger errors when compared to points in blue (estimated with a larger neighborhood size).
Why consider only a few possible parameterizations, when spcv allows one to exploit R’s lapply function to quickly visualize cross-validation errors (i.e., a proxy for out-of-sample error) for a range of parameter values
cv_range = seq(1,8, by=.25) cv.t <- lapply(cv_range , function(i) spcv(df.sp = m.sample, my_idp = i, var_name = "conc"))
par(mfrow = c(1,1), mar = c(4,4,2,2)) plot(cv_range, sapply(cv.t, function(x) x$cv.rmse), col = "black", type = "p", pch =20, ylab = "RMSE", xlab = expression("P"[IDW]))
par(mfrow = c(1,3), mar = c(5,5,5,5)) plot(cv_range, sapply(cv.t, function(x) x$cv.rmse), col = "black", type = "p", pch =20, ylab = "RMSE", xlab = expression("P"[IDW])) plot(cv_range, sapply(cv.t, function(x) median(x$cv.error, na.rm = T)), ylim = c(-4,4), type = "l", ylab = "Median Prediction Error\n90% Prediction Interval", xlab = expression("P"[IDW])) points(cv_range, sapply(cv.t, function(x) median(x$cv.error, na.rm = T)), col = "black", type = "p", pch =20) points(cv_range, sapply(cv.t, function(x) quantile(x$cv.error, .95, na.rm = T)), col = "gray", type = "l") points(cv_range, sapply(cv.t, function(x) quantile(x$cv.error, .05, na.rm = T)), col = "gray", type = "l") plot(cv_range, sapply(cv.t, function(x) sum(x$cv.error, na.rm = T)), col = "black", type = "p", pch =20, ylab = expression("Overall Bias (" ~ Sigma~ CV[error]~")"), xlab = expression("P"[IDW]))
`
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.