A Tutorial for the `sandwichr` R Package"

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE, fig.cap = " ", fig.path='figs/')

The sandwichr package performs spatial prediction^[Wang, J. F., Haining, R., Liu, T. J., Li, L. F., & Jiang, C. S. (2013). Sandwich estimation for multi-unit reporting on a stratified heterogeneous surface. Environment and Planning A, 45(10), 2515-2534.] based on the spatial stratified heterogeneity (SSH) theory^[Wang, J. F., Haining, R., Liu, T. J., Li, L. F., & Jiang, C. S. (2013). Sandwich estimation for multi-unit reporting on a stratified heterogeneous surface. Environment and Planning A, 45(10), 2515-2534.]. Three inputs are required for this model: sampling, SSH, and reporting layers. By utilizing the known distribution of the population in the SSH layer, this model seeks to estimate the mean value of the sampling attribute and its standard error for each reporting unit. The Sandwich model is still applicable when the spatial dependence is weak because it does not rely on the spatial dependence, provided that the geospatial surface is properly stratified.

First, we need to install the sandwichr package from CRAN (only do this once).

# Install the sandwichr package
# install.packages("sandwichr")

Then use library() to load the installed package into your environment. Remember to load the package every time you use it.

# Import the sandwichr package and other packages
library("sandwichr")
library(ggplot2)
library(ggpubr)
library(dplyr)
library(ape)

Here,we present two case studies to demonstrate the functionality of this package.

Simulated data

In the first case study, prediction is performed on a simulated geospatial surface. The population data is organized as a 20*20 grid. This population is divided into four continuous strata. We generate 41 sampling units at random within the grid. Based on the sample, we would really like to infer the values of seven reporting units that make up the surface. The sampling, SSH, and reporting layers are all in the shapefile format.

Loading data

To successfully run the model, the load.data.shp function is used to convert the shapefiles of sampling, stratification, and reporting layers into a list of sf (simple feature; see Simple Features for R for references) objects for model input. You can specify the directory and names for your input files. It should be noted that these input files must be located in the same directory.

# Input data from shapefiles
sim.sampling.name <- system.file("extdata", "sim.sampling.shp", 
                                package="sandwichr")
sim.ssh.name <- system.file("extdata", "sim.ssh.shp", 
                           package="sandwichr")
sim.reporting.name <- system.file("extdata", "sim.reporting.shp", 
                                 package="sandwichr")

sim.data <- load.data.shp(sampling.file=sim.sampling.name, 
                      ssh.file=sim.ssh.name,
                      reporting.file=sim.reporting.name)

# Sampling
head(sim.data[[1]])
class(sim.data[[1]])
attributes(sim.data[[1]])

# Stratification
head(sim.data[[2]])
class(sim.data[[2]])
attributes(sim.data[[2]])

# Reporting
head(sim.data[[3]])
class(sim.data[[3]])
attributes(sim.data[[3]])

Calculating Moran's I

We use the Moran's I to evaluate the spatial dependence of the simulated data.

sim.dists <- as.matrix(dist(cbind(sim.data[[1]]$Lon, sim.data[[1]]$Lat)))
sim.dists.inv <- 1/sim.dists
diag(sim.dists.inv) <- 0
Moran.I(sim.data[[1]]$Value, sim.dists.inv)

Selection of the stratification layer(s)

The accuracy of SSH-based spatial prediction is determined by the stratification layer. In an ideal stratification layer, values of the target attribute are expected to be homogeneous within each stratum and to differ between the strata. To determine a proper stratification layer for the Sandwich model, the geographical detector model^[Wang, J. F., Li, X. H., Christakos, G., Liao, Y. L., Zhang, T., Gu, X., & Zheng, X. Y. (2010). Geographical detectors-based health risk assessment and its application in the neural tube defects study of the Heshun Region, China. International Journal of Geographical Information Science, 24(1), 107-127.] is applied to quantify the SSH of the target attribute with regard to the candidate stratification(s).

First, we combine the sampling and candidate stratification layers into a single data frame.

# Prepare the stratification layer(s) for evaluation
sim.join <- ssh.data.shp(object=sim.data[[1]], ssh.lyr=sim.data[[2]], ssh.id="X")
head(sim.join)

The factor detector q-statistic in the geographical detector model is applied through the ssh.test function to measure the SSH of the sampling data in terms of different stratifications. This function takes a data frame generated by the ssh.data.shp (or ssh.data.txt, which will be introduced later) function (argument object), where the sampling attribute (argument y) and the strata ID(s) (argument x) must be specified. In this example, the output of ssh.test (q = .81) implies that the candidate stratification layer has a high determinant power over the attribute. Therefore, it will be reasonable to select the stratification layer for subsequent prediction.

# Calculate the geographical detector q-statistic
ssh.test(object=sim.join, y="Value", x=c("X"), test="factor")

We can visualize the mean and standard deviation of the sample in each stratum.

p = ggerrorplot(sim.join, x = "X", y = "Value", 
            desc_stat = "mean_sd", color = "black",
            add = "violin", add.params = list(color = "darkgray")
            )

p + theme(axis.title.x = element_blank())

sim.join %>%                                        
  group_by(X) %>%                         
  summarise_at(vars(Value),              
               list(name = mean))

sim.join %>%
  group_by(X) %>%
  summarise(mean=mean(Value), sd=sd(Value), n=n())

Running the Sandwich mapping model

The Sandwich mapping model is performed using the sandwich.model function, which returns the mean value of the sampling attribute and its standard error for each reporting unit.

# Perform the SSH based spatial prediction
sim.sw <- sandwich.model(object=sim.data, sampling.attr="Value", type="shp")
head(sim.sw$object)
summary(sim.sw)

The interpolated values and standard errors can be visualized using plot.mean and plot.se().

# Plot the estimated mean values and standard errors
ggplot2::autoplot(object=sim.sw)

Model validation

To evaluate the overall accuracy of the Sandwich mapping model, a stratified k-fold cross validation is performed using the sandwich.cv function. A diagnostic statistic called cross-validation estimate (CVE) will be calculated. Here, we present the result of a stratified 5-fold cross validation.

# Perform k-fold cross validation
set.seed(0)
sim.cv <- sandwich.cv(object=sim.data, sampling.attr="Value", k=5, type="shp", ssh.id.col="X")
sim.cv

Breast cancer incidence in mainland China

The second case study interpolates the breast cancer incidence across mainland China based on 242 samples retrieved from the Chinese Cancer Registry Annual Report. The urban-rural classification (either urban or rural areas are typically not connected in space) is used as the stratification layer. County-level divisions are used as the reporting units. In this example, all the input data are in csv format.

Loading data

The load.data.txt function is used to prepare text files into a list of data frames for model input. Two files are needed for this function: one linking sampling and stratification layers, and the other linking reporting and stratification layers.

# Input data from text files
bc.sampling_ssh.name <- system.file("extdata", "bc_sampling_ssh.csv", 
                                package="sandwichr")
bc.reporting_ssh.name <- system.file("extdata", "bc_reporting_ssh.csv", 
                                 package="sandwichr")

bc.data <- load.data.txt(sampling_ssh.file=bc.sampling_ssh.name, 
                         reporting_ssh.file=bc.reporting_ssh.name)

# Sampling-stratification
head(bc.data[[1]])    
class(bc.data[[1]])

# Reporting-stratification
head(bc.data[[2]])    
class(bc.data[[2]])

Calculating Moran's I

We use the Moran's I to evaluate the SAC of the breast cancer incidence.

bc.dists <- as.matrix(dist(cbind(bc.data[[1]]$X, bc.data[[1]]$Y)))
bc.dists.inv <- 1/bc.dists
diag(bc.dists.inv) <- 0
Moran.I(bc.data[[1]]$Incidence, bc.dists.inv)

Selection of the stratification layer(s)

First, we use the ssh.data.txt function to convert the input data for evaluation.

# Prepare the stratification layer for evaluation
bc.join <- ssh.data.txt(object=bc.data)
head(bc.join)

The factor detector q-statistic in the geographical detector model is applied through the ssh.test function to measure the SSH of the sample regarding the urban-rural classification. In this example, the output of ssh.test implies a relatively high level of SSH in the distribution of breast cancer incidence concerning the given stratification (q = .52). Therefore, it is reasonable to select the urban-rural classification as the stratification layer for subsequent prediction.

# Calculate the geographical detector q-statistic
ssh.test(object=bc.join, y="Incidence", x="SSHID", test="factor", type="txt")

We can visualize the mean and standard deviation of the sampled breast cancer incidence in each stratum.

p = ggerrorplot(bc.data[[1]], x = "SSHID", y = "Incidence", 
            desc_stat = "mean_sd", color = "black",
            add = "violin", add.params = list(color = "darkgray")
            )

p + scale_x_discrete(labels=c("1" = "Urban", "2" = "Rural")) + 
  theme(axis.title.x = element_blank()) + labs(y="Breast Cancer Incidence (Rate per 100,000)")

bc.data[[1]] %>%                                        
  group_by(SSHID) %>%                         
  summarise_at(vars(Incidence),              
               list(name = mean))

bc.data[[1]] %>%
  group_by(SSHID) %>%
  summarise(mean=mean(Incidence), sd=sd(Incidence), n=n())

Running the Sandwich mapping model

The Sandwich mapping model is performed using the sandwich.model function, which outputs the mean value of the sampling attribute and its standard error for each reporting unit.

# Perform the SSH based spatial prediction
bc.sw <- sandwich.model(object=bc.data, sampling.attr="Incidence", type="txt", 
                        ssh.id.col="SSHID", ssh.weights=list(c(1,2), c("W1","W2")))
head(bc.sw$object)
summary(bc.sw)

Model validation

To evaluate the overall accuracy of SSH-based spatial prediction, a k-fold cross validation can be performed using the sandwich.cv function. A diagnostic statistic called root mean square error (RMSE) will be calculated. Here, we present the result of a 5-fold cross validation.

# Perform k-fold cross validation
set.seed(0)
bc.cv <- sandwich.cv(object=bc.data, sampling.attr="Incidence", k=5, type="txt", 
                     ssh.id.col="SSHID", reporting.id.col="GBCODE", 
                     ssh.weights=list(c(1,2), c("W1","W2")))
bc.cv


Try the sandwichr package in your browser

Any scripts or data that you put into this service are public.

sandwichr documentation built on April 27, 2023, 1:10 a.m.