In ecological or environmental surveys, it is often desired to predict the mean or total of a variable in some finite region. However, because of time and money constraints, sampling the entire region is often unfeasible. The purpose of the sptotal
R
package is to provide software that gives a prediction for a quantity of interest, such as a total, and an associated standard error for the prediction. The predictor, referred to as the Finite-Population-Block-Kriging (FPBK) predictor in the literature [@ver_hoef_spatial_2008], incorporates possible spatial correlation in the data and also incorporates an appropriate variance reduction for sampling from a finite population.
In the remainder of the paper, we give an overview of both the background of the method and of the sptotal
package.
sptotal
provides an implementation of the Finite Population Block Kriging (FPBK) methods developed in @ver2002sampling and @ver_hoef_spatial_2008. Next we provide a short overview of FPBK.
Suppose that we have a response variable $Y(\mathbf{s}{i})$, $i = 1, 2, \ldots, N$, where the vector $\mathbf{s}_i$ contains the coordinates for the $i^{th}$ spatial location and $N$ is a finite number of spatial locations. Then $\mathbf{y}$, an $N$-length column vector of the $Y(\mathbf{s}{i})$, can be modeled with a spatial linear model \mbox{} \begin{equation} \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}, \end{equation} \noindent where $\mathbf{X}$ is a design matrix for the fixed effects and $\boldsymbol{\beta}$ is a parameter vector of fixed effects. The vector of random errors follows a multivariate normal distribution with a mean vector of $\mathbf{0}$ and a covariance of \mbox{} \begin{equation} \text{var}(\boldsymbol{\epsilon}) = \sigma^2 \mathbf{R} + \tau^2 \mathbf{I}, \end{equation} \noindent where $\sigma^2$ is the spatial dependent error variance (commonly called the partial sill), $\tau^2$ is the spatial independent error variance (commonly called the nugget), and $\mathbf{I}$ is the identity matrix. The $i^{th}$ row and $j^{th}$ column of the $N \times N$ spatial correlation matrix $\mathbf{R}$ contains the correlation between the random error of the $i^{th}$ spatial location, $\epsilon_i$, and the random error of the $j^{th}$ spatial location, $\epsilon_j$. A common model used to generate $\mathbf{R}$ is the exponential correlation function [@cressie2015statistics].
FPBK predicts some linear function of the response, $f(\mathbf{y}) = \mathbf{b}^\prime \mathbf{y}$, where $\mathbf{b}$ is an $N$-length column vector of weights. A common vector of weights is a vector of 1's so that the resulting prediction is for the total abundance across all sites. If only some of the values in $\mathbf{y}$ are observed, then the sptotal
package can be used to find the the Best Linear Unbiased Predictor (BLUP) for $\mathbf{b}^\prime \mathbf{y}$, referred to as the FPBK predictor, along with its prediction variance.
The primary functions in the sptotal
package are described in the following section. In short, the FPBK method is implemented in sptotal
's predict()
generic function, which is used on a spatial model that is fit with sptotal::slmfit()
.
Before discussing comparable methods and R
packages, we show how the main functions in sptotal
can be used on a real data set to predict total abundance of moose in a region of Alaska. We use the AKmoose_df
data in the sptotal
package, provided by the Alaska Department of Fish and Game.
library(sptotal) data("AKmoose_df")
The data contains a response variable total
, x-coordinate centroid variable x
, y-coordinate centroid variable y
, and covariates elev_mean
(the elevation) and strat
(a stratification variable). There are a total of 860 rows of unique spatial locations. Locations that were not surveyed have an NA
value for total
.
The two primary functions in sptotal
are slmfit()
, which fits a spatial linear model, and predict.slmfit()
, which uses FPBK to predict a quantity of interest (such as a mean or total) using a fitted slmfit
object. slmfit()
has required arguments formula
, data
, xcoordcol
, and ycoordcol
. If data
is a simple features object from the sf
[@pebesma2018simple] package, then xcoordcol
and ycoordcol
are not required. The CorModel
argument is the correlation model used for the errors.
moose_mod <- slmfit(total ~ elev_mean + strat, data = AKmoose_df, xcoordcol = "x", ycoordcol = "y", CorModel = "Exponential") summary(moose_mod)
With the summary()
generic, we obtain output similar to the summary output of a linear model fit with lm()
, as well as a table of fitted covariance parameter estimates. Next, we use predict()
to implement FPBK and obtain a prediction for the total abundance across all spatial locations, along with a standard error for the prediction. By default, predict()
gives a prediction for total abundance, though the default can be modified by specifying a column of prediction weights for the vector $\mathbf{b}$ with the wtscol
argument.
predict(moose_mod)
The output of printed predict()
gives a table of prediction information, including the Prediction
(a total abundance of 1610 moose, in this example), the SE
(Standard Error) of the prediction, and bounds for a prediction interval (with a nominal level of 90% by default). Additionally, some summary information about the data set used is given.
AKmoose_df$wts <- c(rep(0, nrow(AKmoose_df) - 300), rep(1, 300)) moose_mod_wts <- slmfit(total ~ elev_mean + strat, data = AKmoose_df, xcoordcol = "x", ycoordcol = "y", CorModel = "Exponential") predict(moose_mod_wts, wtscol = "wts")
sptotal
also provides many helper generic functions for spatial linear models. The structure of the arguments and of the output of these generics often mirrors that of the generics used for base R
linear models fit with lm()
. Examples (applied to the moose_mod
object) include AIC(moose_mod)
, coef(moose_mod)
, fitted(moose_mod)
, plot(moose_mod)
, and residuals(moose_mod)
.
Design-based analysis and k-nearest neighbors [@fix1985discriminatory] are two approaches that can be used to compute a mean or total in a finite population. @dumelle2022comparison provide an overview of design-based spatial analysis and FPBK, showing that FPBK often outperforms the design-based analysis. @ver2013comparison show that FPBK often outperforms k-nearest-neighbors and highlight that quantifying uncertainty is much more challenging with k-nearest-neighbors.
Note that there are many spatial packages in R
that can be used to predict values at unobserved locations, including gstat
[@pebesma2004multivariable], geoR
[@ribeiro2007geor], and spmodel
[@dumelle2023spmodel], among others. What sptotal
contributes is the ability to obtain the appropriate variance of a linear combination of predicted values that incorporates a variance reduction when sampling from a finite number of sampling units.
@dumelle2022comparison used the sptotal
package to compare model-based and design-based approaches for analysis of spatial data. Currently, a Shiny
app is in development at the Alaska Department of Fish and Game that uses sptotal
to predict abundance from moose surveys conducted in Alaska.
The views expressed in this article are those of the author(s) and do not necessarily represent the views or policies of the U.S. Environmental Protection Agency. Mention of trade names or commercial products does not constitute endorsement or recommendation for use.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.