knitr::opts_chunk$set(echo = TRUE)
require(devtools)
require(tidyr)
require(dplyr)
require(sp)
require(stringr)
require(ggplot2)
devtools::load_all()

What is spcv?

spcv, short for spatial cross validation, is a minimalist R package for anticipating prediction errors in spatially interpolated data. For now, it’s based on hold one out cross validation, and it is meant to work with the R spatial statistics package gstat.

spcv currently can be used only to examine the magnitude and distribution errors generated by inverse distance weighted (IDW) interpolations. With minor modification it could be adapted to cross-validate other gstat interpolators (i.e. simple, ordinary, block, and universal kriging)

What is inverse distance weighting?

Inverse distance weighting (or IDW for short) is a simple but flexible interpolation method. An estimate at an unknown point is calculated from a weighted average of the values available at the known points, but the relative contribution of each point is an inverse function of distance ( usually 1 / dist(x[i]-x[j] )^p ). The power parameter (p) can be increased to lend greater weight to near-field points.

IDW differs from kriging methods in that it does not rely on distributional assumptions about the underlying data. IDW is easy to tune and implement, but it offers no upfront prediction about the bias of its estimates. With IDW, accuracy is highly site and dataset specific. Therefore it would be nice to have an easy to use tool to scrutinize IDW interpolation accuracy.

How to use spcv

spcv::spcv() is the main function of the spcv package. It has three inputs:

  • df.sp - a SpatialPointsDataFrame object, which is a data class created by the sp package
  • my_idp - a numeric bandwidth parameter
  • var_name - a string specifying the column of data to be interpolated
  • Calling spcv() on a SpatialPointsDataFrame will impliment IDW interpolation with hold-one-out cross validation.

    It will return a list object containing the following:

  • cv.input - the input SpatialPointsDataFrame object
  • cv.pred - the predictions at each held-out point
  • cv.error - differences between the held-out test points and the interpolated estimate at that location
  • cv.rmse - overall root mean square error between test points and predictions
  • Convenience functions in spcv

    spcv:::convert_spdf_for_ggplot() is a useful function for converting SpatialPointsDataFrame for plotting in ggplot

    A toy example

    Consider m.sample, a SpatialPointsDataFrame object consisting of 100 points in {x,y} coordinate space, containing chemical results at various concentrations (“conc”). It is easily plot using spcv::convert_spdf_for_ggplot2() (also see spcv::theme_basic() for reproducing the aesthetic below).

    m.sample = m3_sample #m3_sample is part of spcv
    head(m.sample,3)
    
    ggplot(spcv::convert_spdf_for_ggplot2(m.sample, "conc"), 
           aes(x,y, col =z)) + 
      geom_point() + 
      spcv::theme_basic() 
    

    IDW estimates depend dramatically on the user's choice of interpolation bandwidth

    m.sample <- readRDS( "../data-raw/spdf_sample1.rds")
    grid <- readRDS( "../data-raw/hr_grid.rds")
    idw.1 <- gstat::idw(conc~1,m.sample, grid, idp = 1)
    idw.2 <- gstat::idw(conc~1,m.sample, grid, idp = 2)
    idw.4 <- gstat::idw(conc~1,m.sample, grid, idp = 4)
    idw.8 <- gstat::idw(conc~1,m.sample, grid, idp = 8)
    gg.1 <- ggplot(convert_spdf_for_ggplot2(idw.1), aes(x,y, fill =z))   + geom_tile() + theme_basic() + ggtitle("bandwidth = 1")
    gg.2 <- ggplot(convert_spdf_for_ggplot2(idw.2), aes(x,y, fill =z))   + geom_tile() + theme_basic() + ggtitle("bandwidth = 2")
    gg.4 <- ggplot(convert_spdf_for_ggplot2(idw.4), aes(x,y, fill =z))   + geom_tile() + theme_basic() + ggtitle("bandwidth = 4")
    gg.8 <- ggplot(convert_spdf_for_ggplot2(idw.8), aes(x,y, fill =z))   + geom_tile() + theme_basic() + ggtitle("bandwidth = 8")
    
    require(gridExtra)
    grid.arrange(gg.1, gg.2, gg.4, gg.8, ncol = 4)
    

    Use spcv to examine interpolation errors with different bandwidths

    cv.1 <- spcv(df.sp = m.sample, my_idp = 1, var_name = "conc")
    cv.2 <- spcv(df.sp = m.sample, my_idp = 2, var_name = "conc")
    cv.4 <- spcv(df.sp = m.sample, my_idp = 4, var_name = "conc")
    cv.8 <- spcv(df.sp = m.sample, my_idp = 8, var_name = "conc")
    cv.16 <- spcv(df.sp = m.sample, my_idp = 16, var_name = "conc")
    
    par(mfrow = c(1,5), mar = c(4,4,3,3))
    ymx  = 8; ymn = -8
    plot(cv.1$cv.input$conc, cv.1$cv.error, main = paste("p =", 1), ylim = c(ymn,ymx),
         xlab = "hold-out value",
         ylab = "prediction error"); 
    abline(0,0, col = "gray")
    
    plot(cv.2$cv.input$conc, cv.2$cv.error, main = paste("p =", 2), ylim = c(ymn,ymx),
         xlab = "hold-out value",
         ylab = "prediction error"); 
    abline(0,0, col = "gray")
    
    plot(cv.4$cv.input$conc, cv.4$cv.error, main = paste("p =", 4), ylim = c(ymn,ymx),    
         xlab = "hold-out value",
         ylab = "prediction error"); 
    abline(0,0, col = "gray")
    
    plot(cv.8$cv.input$conc, cv.8$cv.error, main = paste("p =", 8), ylim = c(ymn,ymx),
         xlab = "hold-out value",
         ylab = "prediction error"); 
    abline(0,0, col = "gray")
    
    plot(cv.16$cv.input$conc, cv.8$cv.error, main = paste("p =", 16), ylim = c(ymn,ymx),
         xlab = "hold-out value",
         ylab = "prediction error"); 
    abline(0,0, col = "gray")
    

    Low bandwidth (p) values give distant points more oomph when interpolating. By contrast, increasing the bandwidth power parameter assigns greater influence to values closest to the interpolated point. For this data, the errors associated with estimates from low-bandwidth interpolation (i.e. p < 2) are similar to using the global mean as an estimator for each unknown point. Here, medium and high concentration samples are systematically under predicted.

    For this data, spatial cross validation illustrates how increasing the inverse distance weighting bandwidth parameter (p) reduces the magnitude of interpolation errors, but only up to a point. Which value of p is best? That probably depends on the goal of the study. For instance, for acute toxins, underpredicting high concentrations of contaminants may pose a larger risk than overestimating low concentrations from the standpoint of human or ecological health.

    Won’t you be my neighborhood?

    Rather than consider the entire dataset in each IDW estimate, interpolation can be restricted to consider only points within some local neighborhood (in gstat, this is specified with a max distance argument).

    spcv::spcv() can be used to observe the effect of tuning the neighborhood size.

    For this data set, spcv shows that neighborhood size and inverse distance weighting bandwidth (p) are related but not entirely redundant. Using spcv suggests that shrinking the neighborhood size appears to reduce the magnitude of interpolation errors, although diminishing returns kick in at higher bandwidth parameter values. However, there are clear reasons why we should consider local over global interpolation. In particular, using a smaller local interpolation neighborhood appears to ameliorate overprediction error for lower concentration samples.

    my_maxdist = 2
    cv.1.md  <- spcv(df.sp = m.sample, my_idp = 1, var_name = "conc", maxdist = my_maxdist)
    cv.2.md  <- spcv(df.sp = m.sample, my_idp = 2, var_name = "conc", maxdist = my_maxdist)
    cv.4.md  <- spcv(df.sp = m.sample, my_idp = 4, var_name = "conc", maxdist = my_maxdist)
    cv.8.md  <- spcv(df.sp = m.sample, my_idp = 8, var_name = "conc", maxdist = my_maxdist)
    cv.16.md <- spcv(df.sp = m.sample, my_idp = 16, var_name = "conc", maxdist = my_maxdist)
    
    par(mfrow = c(1,5), mar = c(4,4,3,3))
    ymx  = 8; ymn = -8
    plot(cv.1$cv.input$conc, 
         cv.1$cv.error, 
         main = paste("p =", 1, "\nmaxdist = ", my_maxdist), 
         ylim = c(ymn,ymx),
         xlab = "hold-out value",
         ylab = "prediction error")
    abline(0,0, col = "gray")
    points(cv.1.md$cv.input$conc, 
           cv.1.md$cv.error, 
           col = "#0000FF50")
    
    plot(cv.2$cv.input$conc, 
         cv.2$cv.error, 
         main = paste("p =", 2, "\nmaxdist = ", my_maxdist),  
         ylim = c(ymn,ymx),
         xlab = "hold-out value",
         ylab = "prediction error"); 
    abline(0,0, col = "gray")
    points(cv.2.md$cv.input$conc, 
           cv.2.md$cv.error,
           col = "#0000FF50")
    
    plot(cv.4$cv.input$conc, 
         cv.4$cv.error, 
         main = paste("p =", 4, "\nmaxdist = ", my_maxdist), 
         ylim = c(ymn,ymx),
         xlab = "hold-out value",
         ylab = "prediction error"); 
    abline(0,0, col = "gray")
    points(cv.4.md$cv.input$conc, 
           cv.4.md$cv.error, 
           col = "#0000FF50")
    
    plot(cv.8$cv.input$conc, 
         cv.8$cv.error,
         main = paste("p =", 8, "\nmaxdist = ", my_maxdist),  
         ylim = c(ymn,ymx),
         xlab = "hold-out value",
         ylab = "prediction error"); 
    abline(0,0, col = "gray")
    points(cv.8.md$cv.input$conc, 
           cv.8.md$cv.error, 
           col = "#0000FF50")
    
    plot(cv.16$cv.input$conc, 
         cv.8$cv.error, 
         main = paste("p =", 16, "\nmaxdist = ", my_maxdist),  
         ylim = c(ymn,ymx),
         xlab = "hold-out value",
         ylab = "prediction error"); abline(0,0, col = "gray")
    points(cv.16.md$cv.input$conc, 
           cv.16.md$cv.error, 
           col = "#0000FF50")
    

    Too much of a good thing?

    If things closer together in geographical space tend to be more alike, it’s tempting to push the gas on bandwidth and pump the brakes on neighborhood size (i.e., shrink the neighborhood size, while allowing all points within than neighborhood to contribute more strongly to the interpolation estimate). But eventually, I would expect the size of the data set to be contrain the efficacy of this stragegy on interpolation accuracy. If we shrink the neighborhood size too much, we will be making an estimate from an extremely small sample – accuracy in an average sense may or may not suffer, but I would expect variance to be high.

    spcv::spcv helps us consider this trade-off and predict, for a given data set, when our an interpolation neighborhood might be too constrained.

    my_maxdist = 1
    cv.1.md1  <- spcv(df.sp = m.sample, my_idp = 1, var_name = "conc", maxdist = my_maxdist)
    cv.2.md1  <- spcv(df.sp = m.sample, my_idp = 2, var_name = "conc", maxdist = my_maxdist)
    cv.4.md1  <- spcv(df.sp = m.sample, my_idp = 4, var_name = "conc", maxdist = my_maxdist)
    cv.8.md1  <- spcv(df.sp = m.sample, my_idp = 8, var_name = "conc", maxdist = my_maxdist)
    cv.16.md1 <- spcv(df.sp = m.sample, my_idp = 16, var_name = "conc", maxdist = my_maxdist)
    
    par(mfrow = c(1,3), mar = c(4,4,3,3))
    ymx  = 8; ymn = -8
    plot(cv.1$cv.input$conc, 
         cv.1$cv.error, 
         main = paste("p =", 1, "\nmaxdist = ", my_maxdist), 
         ylim = c(ymn,ymx),
         xlab = "hold-out value",
         ylab = "prediction error")
    abline(0,0, col = "gray")
    points(cv.1.md$cv.input$conc, 
           cv.1.md$cv.error, 
           col = "#0000FF50")
    points(cv.1.md1$cv.input$conc, 
           cv.1.md1$cv.error, 
           col = "#FF000050")
    
    plot(cv.2$cv.input$conc, 
         cv.2$cv.error, 
         main = paste("p =", 2, "\nmaxdist = ", my_maxdist),  
         ylim = c(ymn,ymx),
         xlab = "hold-out value",
         ylab = "prediction error"); 
    abline(0,0, col = "gray")
    points(cv.2.md$cv.input$conc, 
           cv.2.md$cv.error, 
           col = "#0000FF50")
    points(cv.2.md1$cv.input$conc, 
           cv.2.md1$cv.error,
           col = "#FF000050")
    suppressWarnings(arrows(cv.2.md$cv.input$conc, 
           cv.2.md$cv.error, 
           cv.2.md1$cv.input$conc, 
           cv.2.md1$cv.error, col ="#FF000050", 
           length = .1) )
    
    plot(cv.4$cv.input$conc, 
         cv.4$cv.error, 
         main = paste("p =", 4, "\nmaxdist = ", my_maxdist), 
         ylim = c(ymn,ymx),
         xlab = "hold-out value",
         ylab = "prediction error"); 
    abline(0,0, col = "gray")
    points(cv.4.md$cv.input$conc, 
           cv.4.md$cv.error, 
           col = "#0000FF50")
    points(cv.4.md1$cv.input$conc, 
           cv.4.md1$cv.error, 
           col = "#FF000050")
    suppressWarnings(arrows(cv.4.md$cv.input$conc, 
           cv.4.md$cv.error, 
           cv.4.md1$cv.input$conc, 
           cv.4.md1$cv.error, col ="#FF000050", 
           length = .1))
    

    For this data set, notice how estimates in red (made with a small neighborhood size of 1 unit) seem to reduce the overall interpolation bias but also tend to lead to some larger errors when compared to points in blue (estimated with a larger neighborhood size).

    spcv as a tuning fork

    Why consider only a few possible parameterizations, when spcv allows one to exploit R’s lapply function to quickly visualize cross-validation errors (i.e., a proxy for out-of-sample error) for a range of parameter values

    cv_range = seq(1,8, by=.25)
    cv.t <- lapply(cv_range , function(i) spcv(df.sp = m.sample, 
                                               my_idp = i, 
                                               var_name = "conc"))
    
    par(mfrow = c(1,1), mar = c(4,4,2,2))
    plot(cv_range, sapply(cv.t, function(x) x$cv.rmse), 
         col = "black", 
         type = "p", 
         pch =20, 
         ylab = "RMSE",
         xlab = expression("P"[IDW]))
    
    par(mfrow = c(1,3), mar = c(5,5,5,5))
    plot(cv_range, sapply(cv.t, function(x) x$cv.rmse), 
         col = "black", 
         type = "p", 
         pch =20, 
         ylab = "RMSE",
         xlab = expression("P"[IDW]))
    plot(cv_range, 
         sapply(cv.t, function(x) median(x$cv.error, na.rm = T)),
         ylim = c(-4,4),
         type = "l", 
         ylab = "Median Prediction Error\n90% Prediction Interval",
         xlab = expression("P"[IDW]))
    points(cv_range, sapply(cv.t, function(x) median(x$cv.error, na.rm = T)), col = "black", type = "p", pch =20)
    points(cv_range, sapply(cv.t, function(x) quantile(x$cv.error, .95, na.rm = T)), col = "gray", type = "l")
    points(cv_range, sapply(cv.t, function(x) quantile(x$cv.error, .05, na.rm = T)), col = "gray", type = "l")
    
    plot(cv_range, sapply(cv.t, function(x) sum(x$cv.error,  na.rm = T)),
         col = "black", 
         type = "p", 
         pch =20, 
         ylab = expression("Overall Bias (" ~ Sigma~ CV[error]~")"),
         xlab = expression("P"[IDW]))
    

    `



    kmayerb/spcv documentation built on May 26, 2019, 2:34 a.m.