Intro to spind"
In spind: Spatial Methods and Indices

spind is a package dedicated to removing the spectre of spatial autocorrelation in your spatial models. It contains many of the tools you need to make predictions, assess model performance, and conduct multimodel inference for 2-D gridded data sets using methods that are robust to spatial autocorrelation.

The theory underlying the use of GEEs, WRMs, and many of the other tools in this package is covered elsewhere in the literature, and for the purposes of this vignette, we assume that you have already read those papers. If you haven't, citations are included in the footnotes of this vignette as well as in the documentation of each function. We also assume that you have a working knowledge of how to use R. This vignette will focus on demonstrating how to utilize this package to create a robust model from spatially referenced data and subsequently assess its accuracy. Along the way, we will use a couple different data sets to examine how these functions work and investigate how one might use them to create a spatially robust model. This particular demonstration will focus on species distribution models (hereafter referred to as SDMs), but this general framework can be applied to any data set that is spatially structured (e.g. economic, sociological).

Generalized Estimating Equations (GEEs) for species distribution modeling [^1]

This package utilizes the functions already written for GEEs from the packages gee[^2] and geepack[^3] and adapts them for easy use in the context of an SDM. Let's start with a fairly simple GEE using the simulated musdata data set included in the package. Before we get started though, note that GEE requires that predictor variables be continuous variables.

[^1]: Carl G & Kuehn I, 2007. Analyzing Spatial Autocorrelation in Species Distributions using Gaussian and Logit Models, Ecol. Model. 207, 159 - 170]

[^2]: Carey, V. J., 2006. Ported to R by Thomas Lumley (versions 3.13, 4.4, version 4.13)., B. R. gee: Generalized Estimation Equation solver. R package version 4.13-11.

[^3]: Yan, J., 2004. geepack: Generalized Estimating Equation Package. R package version 0.2.10.

library(spind)
library(ggplot2)

knitr::opts_chunk$set(collapse = TRUE, comment = "#>")

data(musdata)
data(carlinadata)

# Examine the structure to familiarize yourself with the data
?musdata
head(musdata)

?carlinadata
head(carlinadata)

# Next, fit a simple GEE and view the output
coords <- musdata[ ,4:5]

mgee <- GEE(musculus ~ pollution + exposure, family = "poisson", data = musdata,
            coord = coords, corstr = "fixed", scale.fix = FALSE)

summary(mgee, printAutoCorPars = TRUE)

plot(mgee)

predictions <- predict(mgee, newdata = musdata)

# you can modify the plot itself by extracting it from the plot object and
# treating it as any other ggplot object.

library(ggplot2)
my_plot <- mgee$plot

# more of a base-R graphic feel
my_plot + 
  theme(plot.background = element_rect(fill = NA,
                                       color = 'black',
                                       size = 1.25))

As you can see, this package includes methods for plot, summary and predict. These are useful in evaluating model fit and autocorrelation of residuals compared to a non-spatial model (in this case, a GLM with the same family as the GEE). The plot.GEE method generates a plot of the autocorrelation of the residuals from the fitted GEE and a GLM of the same family for comparison. Note that a QIC (Quasi-information criterion) score is reported as opposed to AIC. This is calculated based on the method described in Hardin & Hilbe[^4]^,^[^5] and is implemented using the function qic.calc.

Also note that trying to fit GEEs with corstr = "fixed" to large data sets (i.e. number of observations is approximately sqrt(.Machine$integer.max)) will produce errors as the resulting variance-covariance matrices will be too large to be handled in R (you may well run into problems before this point depending on how much RAM you have available). This is where fitting clustered models can come in handy, as they work with smaller, more manageable matrices. These can be specified by changing the corstr to either "quadratic" or "exchangeable".

[^4]: Hardin, J.W. & Hilbe, J.M. (2003) Generalized Estimating Equations. Chapman and Hall, New York.

[^5]: Barnett et al. Methods in Ecology & Evolution 2010, 1, 15-24.

Wavelet Revised Models (WRMs) [^6]

Next, we'll examine the other main model that is introduced in this package - the Wavelet Revised Model. These are implemented using wavelet transforms from the waveslim package.[^7] Let's start with a fairly simple WRM using the same musdata data set as above. As with GEE, WRM also requires that predictor variables are continuous.

[^6]: Carl, G., Kuehn, I. (2010): A wavelet-based extension of generalized linear models to remove the effect of spatial autocorrelation. Geographical Analysis 42 (3), 323 - 337

[^7]: Whitcher, B. (2005) Waveslim: basic wavelet routines for one-, two- and three-dimensional signal processing. R package version 1.5.

mwrm <- WRM(musculus ~ pollution + exposure, family = "poisson",
            data = musdata, coord = coords, level = 1)

plot(mwrm)

summary(mwrm)

predictions <- predict(mwrm, newdata = musdata)

WRM has many of the same features as GEE. The plot.WRM allows you to visually examine the autocorrelation of residuals from a GLM of the same error family as your WRM. Methods for predict and summary allow you to examine outputs from the model using the same code as you might use for a GLM. However, note that this reports an AIC score, rather than a QIC score as in the GEE. You can extract and modify the ggplot output from WRM as well using the plot slot in the WRM class.

Other features specific to WRMs

WRM has a number of other model-specific functions that you may find useful in diagnosing model fit and understanding your results. For example, you might want to plot the variance and/or covariance of each of your variables as a function of level. The covar.plot function allows you to visually examine the wavelet relationships from your model. You can select whether to plot the variance or covariance using plot argument which accepts "var" or "covar" as inputs. Note that we are going to switch to the carlinadata data set now.

coords <- carlinadata[ ,4:5]

wave_covariance <- covar.plot(carlina.horrida ~ aridity + land.use - 1,
                              data = carlinadata, coord = coords, wavelet = "d4",
                              wtrafo = 'modwt', plot = 'covar')

wave_variance <- covar.plot(carlina.horrida ~ aridity + land.use - 1,
                            data = carlinadata, coord = coords, wavelet = "d4",
                            wtrafo = 'modwt', plot = 'var')

wave_variance$result
wave_covariance$result

# view plots side by side

library(gridExtra)
grid.arrange(wave_variance$plot, wave_covariance$plot)

You may also want to view the smooth components of your wavelets at different scales. For this, we offer the upscale function. upscale allows you to visually examine your data at a number of different levels of scale which controls the resolution of the grid cells in your data set.[^8] It also offers the option to adjust padding settings so you can see how that influences your smooth components. The default padding level is the mean value of your input vector, but it can be easily switched using the pad argument, which works the same way as in the other WRM functions. A quick example below using carlinadata data set.

NOTE: The color.maps argument is soft-deprecated from spind >= v2.2.

[^8]: Carl G, Doktor D, Schweiger O, Kuehn I (2016) Assessing relative variable importance across different spatial scales: a two-dimensional wavelet analysis. Journal of Biogeography 43: 2502-2512.

upscale(carlinadata$land.use, coord = coords,
        pad = mean(carlinadata$land.use))

Multi-model inference with GEEs and WRMs

spind provides a couple of frameworks for conducting multi-model inference analyses and some helper functions to assist you when examining the results. The first that we'll examine here is the step.spind function, which implements step-wise model selection. The process is loosely based on MASS::stepAIC and stats::step, but is specific to classes GEE and WRM. For GEEs, step.spind uses models with the lowest QIC scores to determine what the next step will be. For WRMs, you have the option of using AIC or AICc (AIC corrected for small sample sizes) using the logical AICc argument.

Currently, the function only supports backwards model selection. In other words, you have to start with all of the variables in your model formula and remove them in a stepwise fashion. We hope to add forward model selection methods shortly. Additionally, step.spind is written to always respect the hierarchy of variables in the model and currently the user cannot directly override this. For example, step.spind would not remove race while retaining I(race^2). We may change that in the future, but it will remain like this at least until the next major release. Currently, it recognizes polynomial variables by matching variable names located inside of I(var^some_power) and interaction terms by searching for var1:var2 in the model terms. If you want to use a higher order polynomial variable and are not worried about the variable hierarchy, you can create a separate variable (i.e. race_2) and use that in the model.

We'll go through an example of step.spind using a GEE on the birthwt data set in the MASS package below. The data in birthwt aren't at all related to SDMs and are not spatially structured, but we hope that in using this data set, we will demonstrate how this function can work with many types of data sets.

# For demonstration only. We are artificially imposing a grid structure
# on data that is not actually spatial data

library(MASS)
data(birthwt)


x <- rep(1:14, 14)
y <- as.integer(gl(14, 14))
coords <- cbind(x[-(190:196)], y[-(190:196)])

formula <- formula(low ~ age + lwt + race + smoke + ftv +  bwt + I(race^2))

mgee <- GEE(formula, family = "gaussian", data = birthwt,
            coord = coords, corstr = "fixed",scale.fix = TRUE)

mwrm <- WRM(formula, family = "gaussian", data = birthwt,
          coord = coords, level = 1)

ssgee <- step.spind(mgee, birthwt)
sswrm <- step.spind(mwrm, birthwt, AICc = TRUE)

best.mgee <- GEE(ssgee$model, family = "gaussian", data = birthwt,
                 coord = coords, corstr = "fixed",scale.fix = TRUE)

best.wrm <- WRM(sswrm$model, family = "gaussian", data = birthwt,
                coord = coords, level = 1)

summary(best.mgee, printAutoCorPars = FALSE)
summary(best.wrm)

Additionally, we offer multimodel inference tools for GEEs and WRMs which are loosely based on the MuMIn package. These are implemented in mmiWMRR and mmiGEE. They enable you to examine the effect that the grid resolution and variable selection have on the resulting regressions, and then select the appropriate model for subsequent analyses. Note that mmiWMRR has two more arguments than mmiGEE that must be specified.

# Example for WRMs
data(carlinadata)
coords <- carlinadata[ ,4:5]

wrm <- WRM(carlina.horrida ~ aridity + land.use, family = "poisson",
           data = carlinadata, coord = coords, level = 1, wavelet = "d4")

ms1 <- scaleWMRR(carlina.horrida ~ aridity + land.use, family = "poisson",
                 data = carlinadata, coord = coords, scale = 1,
                 wavelet = 'd4', trace = FALSE)

mmi <- mmiWMRR(wrm, data = carlinadata, scale = 1, detail = TRUE)

# Example for GEEs
library(MASS)
data(birthwt)

# impose an artificial (not fully appropriate) grid structure
x <- rep(1:14, 14)
y <- as.integer(gl(14, 14))
coords <- cbind(x[-(190:196)], y[-(190:196)])

formula <- formula(low ~ race + smoke +  bwt)

mgee <- GEE(formula, family = "gaussian", data = birthwt,
            coord = coords, corstr = "fixed", scale.fix = TRUE)

mmi <- mmiGEE(mgee, birthwt)

Finally, we offer one further model selection procedure specific to WRMs. rvi.plot uses mmiWMRR and creates a plot of the relative importance of each explanatory variable as a function of the resolution of the grid (in other words, as a function of the scale argument in mmiWMRR). It will also print the resulting model selection tables to the console.

data(carlinadata)
coords <- carlinadata[ ,4:5]

rvi <- rvi.plot(carlina.horrida ~ aridity + land.use, family = "poisson",
                data = carlinadata, coord = coords, maxlevel = 4, 
                detail = TRUE, wavelet = "d4")

rvi$rvi
rvi$plot

Goodness of fit and model performance

You may also find that a model not implemented by this package works best for your data. We've implemented some spatially corrected accuracy measures that you can use to assess goodness of model fit. The first two of these are categorized according to whether or not their outputs are dependent on the chosen threshold and first appeared in spind v1.0 [^9]. th.dep (threshold dependent) and th.indep (threshold independent) are designed to work on any number of model types, all you need is a set of actual values, predictions, and their associated coordinates. We'll use the hook data set to see how these work.

[^9]: Carl G, Kuehn I (2017) Spind: a package for computing spatially corrected accuracy measures. Ecography 40: 675-682. doi: 10.1111/ecog.02593

data(hook)

# Familiarize yourself with the data
?hook
head(hook)

df <- hook[ ,1:2]
coords <- hook[ ,3:4]

# Threshold dependent metrics
th.dep.indices <- th.dep(data = df, coord = coords, spatial = TRUE)

# Confusion Matrix
th.dep.indices$cm

# Kappa statistic
th.dep.indices$kappa

# Threshold independent metrics
th.indep.indices <- th.indep(data = df, coord = coords, 
                             spatial = TRUE)

# AUC
th.indep.indices$AUC

# TSS
th.indep.indices$TSS

# AUC plot
th.indep.indices$plot

Finally, many analyses require a calculation of spatial autocorrelation. To that end, we include the function acfft (AutoCorrelation Fast Fourier Transform) to calculate spatial autocorrelation using Moran's I statistic. While many other packages include functions to perform this analysis, ours provides improved efficiency by harnessing the power of fast Fourier transforms which reduce the time needed to compute the statistic. A quick example below using a GLM and the musdata data set.

coords <- musdata[ ,4:5]
mglm <- glm(musculus ~ pollution + exposure, family = "poisson",
            data = musdata)

ac <- acfft(coords, resid(mglm, type = "pearson"),
            lim1 = 0, lim2 = 1, dmax = 10)
ac

Note that you can adjust the number of distance bins to examine in acfft using the dmax argument. The default is 10. Finally, you can adjust the size of the distance bins by adjusting the values of lim1 and lim2 to meet your needs.

Wrapping up

Hopefully, you are now ready to utilize GEEs and WRMs to conquer the world of spatial modeling. However, if this vignette has not served its purpose and you still have questions about how to use these tools, please let us know. Of course, no package is complete without bugs and we are always trying to improve our code. If you find any bugs that need squashing, have suggestions for additional functionality or improvements to existing functionality (or this vignette), please don't hesitate to contact us[^10].

[^10]: Contact email - levisc8@gmail.com or visit the Github repo and create an issue at https://github.com/levisc8/spind/issues.