In lcgodoy/smile: Spatial Misalignment: Interpolation, Linkage, and Estimation

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(smile)
library(ggplot2)
library(sf)

Spatial data often needs to be analyzed at different scales or resolutions. This package helps you align disparate spatial data using two approaches:

Model-based: This approach is based on the assumption that areal data (e.g., variables measured at census tracts) are composed by averages of a continuous underlying process [@gelfand2001change]. In particular, we assume this underlying process to be a Gaussian Process [@moraga2017geostatistical, @johnson2020dealing; @wilson2020pointless].
Non-parametric spatial interpolation: This simpler method estimates values at a new spatial support based on the area of intersection between areal units. If you are interested in the non-parametric approach, visit our vignette on the topic.

This vignette focuses on the model-based approach for areal data (like, for instance, census tract data). This method involves converting sf[@pebesma2018simple] objects (a common format for spatial data in R) to the spm format used by this package. We'll use the ggplot2[@wickham2011ggplot2] package for visualization.

To demonstrate this conversion, we'll use life expectancy at birth (LEB) and the index of multiple deprivation (IMD) data for Liverpool from @johnson2020dealing. This data is available at different spatial resolutions (MSOA and LSOA). See Figure \ref{fig:leb-msoa} for a visualization of life expectancy at the MSOA level.

data(liv_lsoa) # loading the LSOA data
data(liv_msoa) # loading the MSOA data

## workaround for compatibility with different PROJ versions
st_crs(liv_msoa) <-
    st_crs(liv_msoa)$input
st_crs(liv_lsoa) <-
    st_crs(liv_lsoa)$input

```{R read-data, fig.cap="LEB in Liverpool at the MSOA."} ggplot(data = liv_msoa, aes(fill = leb_est)) + geom_sf(color = "black", lwd = .1) + scale_fill_viridis_b(option = "H") + theme_minimal()

<!-- We consider, in a first moment, that this random variable is normally -->
<!-- distributed with a constant mean and a certain (spatial) covariance -->
<!-- matrix. Now, let $A_1$ and $A_2$ be two distinct MSOA's. Under the --> <!--
before mentioned assumption that the observed random variable is driven --> <!--
by an underlying Gaussian random field with isotropic covariance --> <!--
function $C(\lVert x - y \rVert \, ; \, \theta)$, where $\lVert x - y --> <!--
\rVert$ is the distance between two points in the coordinates $x$ and --> <!--
$y$, respectively. Then, the covariance between the (averaged) variable --> <!--
observed at $A_1$ and $A_2$ is define as \[ \Sigma_{12} = Cov(A_1, A_2) --> <!--
= \frac{1}{\lvert A_1 \rvert \lvert A_2 \rvert} \int_{A_1} \int_{A_2} --> <!--
C(\lVert x - y \rVert; \theta) \, dy \, dx.  \] In order to approximate --> <!--
the continuous surface on which these covariances have to be computed, --> <!--
we generate a fine grid over the whole study region (or a fine grid --> <!--
within each MSOA) and approximate such covariance by numerical integration. -->

To analyze the relationship between life expectancy (LEB) and deprivation at the
LSOA level, we need to estimate LEB at this higher resolution. We assume LEB
follows a continuous spatial pattern represented by a Gaussian process.


Mathematically, the observed LEB in an areal resolution (e.g., an MSOA) as
averages of a continuous underlying process across that area. If we knew the
exact parameters of this process, we could calculate the LEB for any
location. However, these calculations involve complex integrals that are
difficult to solve analytically. For more details on the theory behind this, see
this [vignette](https://lcgodoy.me/smile/articles/theory.html).


The `sf_to_spm` function controls how the model-based approach will approximate
these integrals. It supports either numerical or Monte Carlo integration. Here's
how different parameters of this function change the integration method:

* `type`: Controls the grid used for approximating the integrals. Options
  include "regular", "hexagonal" (both for numerical integration), and "random"
  (for Monte Carlo integration).

* `npts`: Specifies the number of grid points used to approximate the
  integral. This can be a single value or a vector with different values for
  each area. More points generally improve accuracy but increase computation
  time.

* `by_polygon`: Determines if the precision of the approximation should vary by
  polygon size.
  * `TRUE`: All integrals are estimated with the same precision.
  * `FALSE`: A grid of points is generated over the entire region, and points
    are assigned to the areas they fall within. This results in more points and
    higher accuracy for larger polygons.

This approach allows us to estimate LEB at the LSOA level while accounting for
the underlying spatial structure of the data.


The code below converts the `liv_msoa` object (in `sf` format) to an `spm`
object. We generate a grid of 1000 points across Liverpool and assign each
point to its corresponding area.
```{R sf-to-spm1}
msoa_spm <-
    sf_to_spm(sf_obj = liv_msoa, n_pts = 1000,
              type = "regular", by_polygon = FALSE,
              poly_ids = "msoa11cd", var_ids = "leb_est")

Here's what the additional arguments of the sf_to_spm function do:

poly_ids: Specifies the column in liv_msoa that contains the unique identifier for each area (e.g., an area ID).
var_ids: Indicates the column containing the "response variable" – the variable we want to model (e.g., life expectancy).

This conversion prepares the data for spatial analysis using the smile package.

For the sake of comparison, the Figure below displays the grids associated with each of the possible combinations of the parameters type and by_polygon when calling the sf_to_spm function. ```{R sf-to-spm2, echo = FALSE, warning = FALSE, message = FALSE, fig.cap="Panel illustrating the grids generated for different approaches to approximate numerical integrals in the model-based approach."} set.seed(123)

msoa_spm1 <- sf_to_spm(sf_obj = liv_msoa, n_pts = 305, type = "random", by_polygon = FALSE, poly_ids = "msoa11cd", var_ids = "leb_est")

msoa_spm2 <- sf_to_spm(sf_obj = liv_msoa, n_pts = 305, type = "regular", by_polygon = FALSE, poly_ids = "msoa11cd", var_ids = "leb_est")

msoa_spm3 <- sf_to_spm(sf_obj = liv_msoa, n_pts = 305, type = "hexagonal", by_polygon = FALSE, poly_ids = "msoa11cd", var_ids = "leb_est")

msoa_spm4 <- sf_to_spm(sf_obj = liv_msoa, n_pts = 305/61, type = "random", by_polygon = TRUE, poly_ids = "msoa11cd", var_ids = "leb_est")

msoa_spm5 <- sf_to_spm(sf_obj = liv_msoa, n_pts = 305/61, type = "regular", by_polygon = TRUE, poly_ids = "msoa11cd", var_ids = "leb_est")

msoa_spm6 <- sf_to_spm(sf_obj = liv_msoa, n_pts = 305/61, type = "hexagonal", by_polygon = TRUE, poly_ids = "msoa11cd", var_ids = "leb_est")

to_plot <- rbind( transform(msoa_spm1$grid, type = "random", by_polygon = FALSE), transform(msoa_spm2$grid, type = "regular", by_polygon = FALSE), transform(msoa_spm3$grid, type = "hexagonal", by_polygon = FALSE), transform(msoa_spm4$grid, type = "random", by_polygon = TRUE), transform(msoa_spm5$grid, type = "regular", by_polygon = TRUE), transform(msoa_spm6$grid, type = "hexagonal", by_polygon = TRUE), deparse.level = 1 )

rm(list = ls(pattern = "^msoa_spm")); invisible(gc(verbose = FALSE, full = TRUE))

ggplot(data = to_plot, aes(color = msoa11cd)) + guides(color = "none") + geom_sf(pch = 15) + scale_color_viridis_d(option = "H") + facet_grid(by_polygon ~ type, labeller = label_both) + theme_bw() + theme(axis.text = element_blank()) ```

For details on fitting models and making predictions, see this vignette.