knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE, fig.cap = " ", fig.path='figs/')
The sandwichr
package performs spatial prediction^[Wang, J. F., Haining, R., Liu, T. J., Li, L. F., & Jiang, C. S. (2013). Sandwich estimation for multi-unit reporting on a stratified heterogeneous surface. Environment and Planning A, 45(10), 2515-2534.] based on the spatial stratified heterogeneity (SSH) theory^[Wang, J. F., Haining, R., Liu, T. J., Li, L. F., & Jiang, C. S. (2013). Sandwich estimation for multi-unit reporting on a stratified heterogeneous surface. Environment and Planning A, 45(10), 2515-2534.]. Three inputs are required for this model: sampling, SSH, and reporting layers. By utilizing the known distribution of the population in the SSH layer, this model seeks to estimate the mean value of the sampling attribute and its standard error for each reporting unit. The Sandwich model is still applicable when the spatial dependence is weak because it does not rely on the spatial dependence, provided that the geospatial surface is properly stratified.
First, we need to install the sandwichr
package from CRAN (only do this once).
# Install the sandwichr package # install.packages("sandwichr")
Then use library()
to load the installed package into your environment. Remember to load the package every time you use it.
# Import the sandwichr package and other packages library("sandwichr") library(ggplot2) library(ggpubr) library(dplyr) library(ape)
Here,we present two case studies to demonstrate the functionality of this package.
In the first case study, prediction is performed on a simulated geospatial surface. The population data is organized as a 20*20 grid. This population is divided into four continuous strata. We generate 41 sampling units at random within the grid. Based on the sample, we would really like to infer the values of seven reporting units that make up the surface. The sampling, SSH, and reporting layers are all in the shapefile format.
To successfully run the model, the load.data.shp
function is used to convert the shapefiles of sampling, stratification, and reporting layers into a list of sf
(simple feature; see Simple Features for R for references) objects for model input. You can specify the directory and names for your input files. It should be noted that these input files must be located in the same directory.
# Input data from shapefiles sim.sampling.name <- system.file("extdata", "sim.sampling.shp", package="sandwichr") sim.ssh.name <- system.file("extdata", "sim.ssh.shp", package="sandwichr") sim.reporting.name <- system.file("extdata", "sim.reporting.shp", package="sandwichr") sim.data <- load.data.shp(sampling.file=sim.sampling.name, ssh.file=sim.ssh.name, reporting.file=sim.reporting.name) # Sampling head(sim.data[[1]]) class(sim.data[[1]]) attributes(sim.data[[1]]) # Stratification head(sim.data[[2]]) class(sim.data[[2]]) attributes(sim.data[[2]]) # Reporting head(sim.data[[3]]) class(sim.data[[3]]) attributes(sim.data[[3]])
We use the Moran's I to evaluate the spatial dependence of the simulated data.
sim.dists <- as.matrix(dist(cbind(sim.data[[1]]$Lon, sim.data[[1]]$Lat))) sim.dists.inv <- 1/sim.dists diag(sim.dists.inv) <- 0 Moran.I(sim.data[[1]]$Value, sim.dists.inv)
The accuracy of SSH-based spatial prediction is determined by the stratification layer. In an ideal stratification layer, values of the target attribute are expected to be homogeneous within each stratum and to differ between the strata. To determine a proper stratification layer for the Sandwich model, the geographical detector model^[Wang, J. F., Li, X. H., Christakos, G., Liao, Y. L., Zhang, T., Gu, X., & Zheng, X. Y. (2010). Geographical detectors-based health risk assessment and its application in the neural tube defects study of the Heshun Region, China. International Journal of Geographical Information Science, 24(1), 107-127.] is applied to quantify the SSH of the target attribute with regard to the candidate stratification(s).
First, we combine the sampling and candidate stratification layers into a single data frame.
# Prepare the stratification layer(s) for evaluation sim.join <- ssh.data.shp(object=sim.data[[1]], ssh.lyr=sim.data[[2]], ssh.id="X") head(sim.join)
The factor detector q-statistic in the geographical detector model is applied through the ssh.test
function to measure the SSH of the sampling data in terms of different stratifications. This function takes a data frame generated by the ssh.data.shp
(or ssh.data.txt
, which will be introduced later) function (argument object
), where the sampling attribute (argument y
) and the strata ID(s) (argument x
) must be specified. In this example, the output of ssh.test
(q = .81) implies that the candidate stratification layer has a high determinant power over the attribute. Therefore, it will be reasonable to select the stratification layer for subsequent prediction.
# Calculate the geographical detector q-statistic ssh.test(object=sim.join, y="Value", x=c("X"), test="factor")
We can visualize the mean and standard deviation of the sample in each stratum.
p = ggerrorplot(sim.join, x = "X", y = "Value", desc_stat = "mean_sd", color = "black", add = "violin", add.params = list(color = "darkgray") ) p + theme(axis.title.x = element_blank()) sim.join %>% group_by(X) %>% summarise_at(vars(Value), list(name = mean)) sim.join %>% group_by(X) %>% summarise(mean=mean(Value), sd=sd(Value), n=n())
The Sandwich mapping model is performed using the sandwich.model
function, which returns the mean value of the sampling attribute and its standard error for each reporting unit.
# Perform the SSH based spatial prediction sim.sw <- sandwich.model(object=sim.data, sampling.attr="Value", type="shp") head(sim.sw$object) summary(sim.sw)
The interpolated values and standard errors can be visualized using plot.mean
and plot.se()
.
# Plot the estimated mean values and standard errors ggplot2::autoplot(object=sim.sw)
To evaluate the overall accuracy of the Sandwich mapping model, a stratified k-fold cross validation is performed using the sandwich.cv
function. A diagnostic statistic called cross-validation estimate (CVE) will be calculated. Here, we present the result of a stratified 5-fold cross validation.
# Perform k-fold cross validation set.seed(0) sim.cv <- sandwich.cv(object=sim.data, sampling.attr="Value", k=5, type="shp", ssh.id.col="X") sim.cv
The second case study interpolates the breast cancer incidence across mainland China based on 242 samples retrieved from the Chinese Cancer Registry Annual Report. The urban-rural classification (either urban or rural areas are typically not connected in space) is used as the stratification layer. County-level divisions are used as the reporting units. In this example, all the input data are in csv format.
The load.data.txt
function is used to prepare text files into a list of data frames for model input. Two files are needed for this function: one linking sampling and stratification layers, and the other linking reporting and stratification layers.
# Input data from text files bc.sampling_ssh.name <- system.file("extdata", "bc_sampling_ssh.csv", package="sandwichr") bc.reporting_ssh.name <- system.file("extdata", "bc_reporting_ssh.csv", package="sandwichr") bc.data <- load.data.txt(sampling_ssh.file=bc.sampling_ssh.name, reporting_ssh.file=bc.reporting_ssh.name) # Sampling-stratification head(bc.data[[1]]) class(bc.data[[1]]) # Reporting-stratification head(bc.data[[2]]) class(bc.data[[2]])
We use the Moran's I to evaluate the SAC of the breast cancer incidence.
bc.dists <- as.matrix(dist(cbind(bc.data[[1]]$X, bc.data[[1]]$Y))) bc.dists.inv <- 1/bc.dists diag(bc.dists.inv) <- 0 Moran.I(bc.data[[1]]$Incidence, bc.dists.inv)
First, we use the ssh.data.txt
function to convert the input data for evaluation.
# Prepare the stratification layer for evaluation bc.join <- ssh.data.txt(object=bc.data) head(bc.join)
The factor detector q-statistic in the geographical detector model is applied through the ssh.test
function to measure the SSH of the sample regarding the urban-rural classification. In this example, the output of ssh.test
implies a relatively high level of SSH in the distribution of breast cancer incidence concerning the given stratification (q = .52). Therefore, it is reasonable to select the urban-rural classification as the stratification layer for subsequent prediction.
# Calculate the geographical detector q-statistic ssh.test(object=bc.join, y="Incidence", x="SSHID", test="factor", type="txt")
We can visualize the mean and standard deviation of the sampled breast cancer incidence in each stratum.
p = ggerrorplot(bc.data[[1]], x = "SSHID", y = "Incidence", desc_stat = "mean_sd", color = "black", add = "violin", add.params = list(color = "darkgray") ) p + scale_x_discrete(labels=c("1" = "Urban", "2" = "Rural")) + theme(axis.title.x = element_blank()) + labs(y="Breast Cancer Incidence (Rate per 100,000)") bc.data[[1]] %>% group_by(SSHID) %>% summarise_at(vars(Incidence), list(name = mean)) bc.data[[1]] %>% group_by(SSHID) %>% summarise(mean=mean(Incidence), sd=sd(Incidence), n=n())
The Sandwich mapping model is performed using the sandwich.model
function, which outputs the mean value of the sampling attribute and its standard error for each reporting unit.
# Perform the SSH based spatial prediction bc.sw <- sandwich.model(object=bc.data, sampling.attr="Incidence", type="txt", ssh.id.col="SSHID", ssh.weights=list(c(1,2), c("W1","W2"))) head(bc.sw$object) summary(bc.sw)
To evaluate the overall accuracy of SSH-based spatial prediction, a k-fold cross validation can be performed using the sandwich.cv
function. A diagnostic statistic called root mean square error (RMSE) will be calculated. Here, we present the result of a 5-fold cross validation.
# Perform k-fold cross validation set.seed(0) bc.cv <- sandwich.cv(object=bc.data, sampling.attr="Incidence", k=5, type="txt", ssh.id.col="SSHID", reporting.id.col="GBCODE", ssh.weights=list(c(1,2), c("W1","W2"))) bc.cv
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.