In artaylor85/pixelate: Pixelate spatial predictions as per their average uncertainty

knitr::opts_chunk$set(
  message = FALSE, # Switch of all messages
  collapse = TRUE,
  comment = "#> ", # To match console 
  fig.align = "center", 
  fig.height = 5, fig.width = 7, 
  out.width = "100%",
  dev = 'png', dpi = 300
)

Introduction

The R package pixelate centres around a single function, pixelate, designed to pixelate spatial predictions as per their average uncertainty.

In brief, the function pixelate groups predictions into a specified number of large pixels; computes the average uncertainty within each large pixel; then, for each large pixel, depending on its average uncertainty, either averages the predictions across it or across smaller pixels nested within it. These averaged predictions are then plotted.

The resulting plot of averaged predictions is selectively pixelated, similar to a photo that is deliberately pixelated to disguise a person's identity. Areas of high average uncertainty in the pixelated plot are unresolved, while areas with high average certainty are resolved, similar to information poor versus rich regions of a satellite map.

For full details of the function pixelate see its documentation accessed by ?pixelate or help(pixelate), its code accessed by pixelate and the code for the functions on which it depends, which can be found in pixelate_internal_functions.R under the R/ directory of the source package, available online github.com/artaylor85/pixelate/R/.

The function pixelate supports spatially continuous predictions (e.g. those generated by a geostatistical @diggle1998 or geospatial model @hengl2018) on any topic providing that they have associated measures of uncertainty. Its output could be plotted using any of the various geographic information systems available. We illustrate its utility using predicted Plasmodium falciparum incidence @weiss2019 and plot its output in R using functions from the ggplot2 package @ggplot2.

As the warning generated by pixelate in the quick example below states, there are issues around averaging uncertainty generated by a so-called 'per-pixel' simulation @gething:10, where 'per-pixel' is synonymous with 'per-prediction' (whereas, in pixelate, a pixel refers to a collection of one or more predictions). This is a problem with the input to pixelate not the basis of pixelate. Predicted P. falciparum incidence is based on a posterior predictive simulation that approximates the joint posterior predictive distribution thus accounting for spatial covariance and enabling averaging of uncertainty [@gething:10; @weiss2019].

While developing this R package, we became aware of an identically named function provided by the Vizumap package @vizumap. Vizumap::pixelate is designed to pixelate choropleth maps @lucchesi2017, whereas pixelate::pixelate is designed to pixelate spatially continuous predictions [@diggle1998; @hengl2018]. Within each region of the cloropleth map, Vizumap::pixelate works by randomly assigning each pixel a value sampled from the margin of error around the corresponding regional estimate (pixels are equally-sized across the entire map); see the vignette of Vizumap and @lucchesi2017 for more details. Another interesting and related article that discusses communicating uncertainty in general highlights the utility of exceedance probabilities (e.g. the probability that some prediction exceeds a threshold) @kuhnert2018. Exceedance probabilities (EPs) provide a very simple alternative to pixelation: 'EPs from a threshold provide one piece of information to the decision maker: a probability.' and so can convey useful information to decision makers in a single map, whether it be a cloropleth map or a map of spatially continuous predictions.

Setup

# Load and attach required packages
library(pixelate) 

if (!require("ggplot2")) {
  stop("Package ggplot2 is needed to build this vignette. Please install it.")
}

Quick example

To use pixelate, simply input some spatial predictions and plot the output as follows. Please follow the detailed example in the next section to better understand the versatility of pixelate and to generate prettier plots.

# Pixelate 
px <- pixelate(SubSaharanAfrica_Pf_incidence) 

# Plot
ggplot(px$pix_df) + geom_raster(mapping = aes(x = x, y = y, fill = pix_z))

Detailed example

Define a wrapper function for plotting

Before we begin, to avoid duplicating code over multiple plots, let's define a function, plot_sp_pred, that wraps various calls to functions from the package ggplot2 @ggplot2.

The plot_sp_pred function generates a raster plot with country borders if possible. Country borders enhance plots but are inessential. We attempt to add them using shape file data and functions ggplot2::geom_polygon and ggplot2::coord_sf. The latter requires the sf package whose installation can be problematic because it may require some external packages. To ensure the code runs for all, including those who have problems with the installation of sf, we check for the availability of sf and generate plots accordingly, with and without country borders.

Shape file data (see e.g. ?SubSaharanAfrica_shp) were obtained using the function malariaAtlas::getShp from the package malariaAtlas @malariaAtlas; see get_shape_files.R in the data-raw/ directory of the source package, available online github.com/artaylor85/pixelate/data-raw/.

# Function to wrap many functions from ggplot2
plot_sp_pred = function(sp_pred, shp_file, legend_position = c(0.02, 0.01)){

  p <- ggplot() +

    # Raster surface
    geom_raster(data = sp_pred, mapping = aes(x = x, y = y, fill = z)) +
    ylab('Latitude (degrees)') + xlab('Longitude (degrees)') + 

    # Modify the theme:
    theme(legend.justification = c(0, 0),
          legend.position = legend_position,
          legend.background = element_rect(fill = NA),
          legend.title = element_text(size = 8),
          legend.text = element_text(size = 8),
          text = element_text(family = "serif"))  

  # Check the availability of package sf (needed for ggplot2::coord_sf) 
  # and proceed accordingly. Note that requireNamespace("sf") is FALSE
  # if sf is available but one of its dependencies (e.g. units) is not
  if (requireNamespace("sf")) { 

    # Set the limits of the plot using coord_sf 
    p <- p + coord_sf(xlim = range(sp_pred$x),    
                      ylim = range(sp_pred$y),
                      expand = FALSE) +

      # Add country borders based on shape file data
      geom_polygon(data = shp_file, aes(x = long, y = lat, group = group), 
                   fill = NA, colour = "gray30", size = 0.1)  
  } else {
    p <- p + coord_fixed(expand = FALSE) 
  }
  return(p)
}

Get and format spatial predictions

The function pixelate supports spatially continuous predictions on any topic providing that they have associated measures of uncertainty. To use pixelate, simply format your spatial predictions, uncertainty measures and coordinates as per the example data sets described in more detail below. A minimum of eight predictions (two by four or four by two) are necessary for pixelation by two or more large pixels.

We illustrate the utility of pixelate using spatial predictions of P. falciparum malaria incidence in 2017 @weiss2019, which are available at the Malaria Atlas Project (MAP) website map.ox.ac.uk. Specifically, we downloaded posterior predictive summaries of P. falciparum incidence by selecting "ANNUAL MEAN OF PF INCIDENCE" at map.ox.ac.uk/malaria-burden-data-download/. We formatted the downloads using the script format_pf_incidence.R in the data-raw/ directory of the source package, available online github.com/artaylor85/pixelate/data-raw/.

(As an aside, the R package malariaAtlas @malariaAtlas provides a direct R interface to parasite predictions and MAP rasters via malariaAtlas::getPR and malariaAtlas::getRaster. However, at the time of writing, uncertainty measures were not available via the package malariaAtlas.)

The formatted incidence predictions are included in the installed package pixelate. They are stored in an observation data frame with an observation per row. Each observation has four variables: a longitude, x; a latitude, y; a prediction, z; and an uncertainty measure, u; for more details see ?SubSaharanAfrica_Pf_incidence and ?CentralAfrica_Pf_incidence, where the latter is simply a cropped version of the former. A selection of observations with predictions that are neither zero (e.g. because they are on a mountain top) or NA (e.g. because they are in the ocean) is printed below.

example_count = 5
knitr::kable(CentralAfrica_Pf_incidence[1:example_count,], row.names = FALSE, 
             caption = sprintf("A selection of %s observations.", example_count))

In our example z is the median predicted incidence and u is the width of the 0.95\% predicted incidence credible interval. (Instead of taking the median and credible interval width as z and u respectively, we could have taken other measures of location and uncertainty, e.g. the mean and standard deviation as in @van2019.) Both z and u are plotted in turn below. But first let's create a box to delineate the central Africa region that features in the example data set CentralAfrica_Pf_incidence:

CA_box = CentralAfrica_Pf_incidence
CA_box = data.frame(x = c(min(CA_box$x), min(CA_box$x), 
                          max(CA_box$x), max(CA_box$x), min(CA_box$x)),
                    y = c(min(CA_box$y), max(CA_box$y), 
                          max(CA_box$y), min(CA_box$y), min(CA_box$y)),
                    id = rep(1,5))

Now let's plot z (median predicted incidence) and u (uncertainty of predicted incidence), adding the box to delineate the central Africa region:

# Set up the plot using plot_sp_pred
unpix <- plot_sp_pred(sp_pred = SubSaharanAfrica_Pf_incidence, 
                      shp_file = SubSaharanAfrica_shp) 

# Add gradient 
unpix <- unpix + scale_fill_gradientn(name = "Median \nincidence", 
                                      colors = c("seashell", "tomato", "darkred"), 
                                      na.value = 'lightblue2') + 
  # Add box
  geom_polygon(data = CA_box, aes(x = x, y = y, group = id), fill = NA,
               colour = "black", size = 0.5)

# Plot
unpix

#  Create a temporary observation data frame and replace z by u 
temp_obs <- SubSaharanAfrica_Pf_incidence
temp_obs$z <- SubSaharanAfrica_Pf_incidence$u

# Set up the plot using plot_sp_pred
uncrt <- plot_sp_pred(sp_pred = temp_obs, shp_file = SubSaharanAfrica_shp)

# Add gradient 
uncrt <- uncrt + scale_fill_gradientn(name = "95% credible \ninterval width", 
                                      colors = c("seashell", "darkgray", "black"), 
                                      na.value = 'lightblue2') + 

  # Add box
  geom_polygon(data = CA_box, aes(x = x, y = y, group = id), fill = NA,
               colour = "black", size = 0.5)

# Plot
uncrt

Uncertainty in predicted P. falciparum incidence (figure above) is low in regions were there are epidemiological data and in regions where there are strong predictors of incidence.

Pixelate spatial predictions using `pixelate`

Spatial predictions are input into pixelate via an argument called obs_df. An observation (obs) refers to a set containing a single prediction throughout the code and package documentation, whereas pixels refer to squares or rectangles comprised of one or more observations and thus predictions. Note that the warning regards averaging uncertainty is silenced in the following examples.

px <- pixelate(obs_df = SubSaharanAfrica_Pf_incidence)

The above call to pixelate implicitly relies on default values for a set of pixelation arguments; see ?pixelate to see what they are. The default values will fail if there are fewer than 480 by 480 predictions in the obs_df. To pixelate a obs_df with fewer predictions, reduce the number of large pixels (argument num_bigk_pix) and/or the number of different pixel sizes (argument bigk):

px_CA <- pixelate(obs_df = CentralAfrica_Pf_incidence,
                  num_bigk_pix = c(12,12)) # Change from default, which is 15

If the obs_df has more than 480 by 480 predictions, you can increase the number of large pixels and/or the number of different pixel sizes:

px_SSA <- pixelate(obs_df = SubSaharanAfrica_Pf_incidence,
                   num_bigk_pix = c(20,20)) # Change from default, which is 15

It is also possible to explore pixels that scale in size (observations per pixel) more rapidly by increasing the scale_factor argument and changing the scale argument from imult to iexpn. However, because the observations per pixel can scale extremely rapidly when default pixelation arguments are changed, the prediction count required for seemingly modest argument combinations may exceed those available and thus generate an error, e.g.

px <- pixelate(obs_df = SubSaharanAfrica_Pf_incidence,
               scale = "iexpn", # Change from default, which is "imult"
               scale_factor = 2) # Change from default, which is 1

Plot the `pixelate`-ed output

To better appreciate the different pixel sizes, we first generate a pixel size legend layer as follows:

gap_y <- 20 # Number of observations between example pixels and left edge of the plot
gap_x <- 350 # Number of observations from the bottom edge of the plot 
pix_legend <- px_SSA$pix_df # Copy coordinates
pix_legend$z <- NA # Set z to NA 
bigk <- nrow(px_SSA$opp) # Extract the number of pixel sizes 

csy <- c(0,cumsum(px_SSA$opp$y)) # Cumulatively add pixel sizes 
xs <- sort(unique(pix_legend$x)) # Extract unique longitude (degrees)
ys <- sort(unique(pix_legend$y)) # Extract unique latitude (degrees)

for(k in 1:bigk){ # For each of the pixel sizes
  ind_y <- gap_y*k + ((csy[k] + 1):csy[k+1]) # Indices of latitude of example pixels
  ind_x <- (gap_x + 1):(gap_x + px_SSA$opp$x[k]) # Indices of longitude of example pixels

  # Indices in the observation data frame 
  inds <- which(pix_legend$x %in% xs[ind_x] & pix_legend$y %in% ys[ind_y]) 
  pix_legend$z[inds] <- 0 # Set location measure of example pixel predictions to zero
}

# Remove all NAs
pix_legend <- pix_legend[!is.na(pix_legend$z), ]

We then add the pixels size legend layer to a plot of the pixelated spatial predictions as follows.

# Create a temporary observation data frame and set z to pixelated pix_z
temp_obs <- px_SSA$pix_df
temp_obs$z <- temp_obs$pix_z

# Generate a pixelated plot
pix_SSA <- plot_sp_pred(sp_pred = temp_obs, shp_file = SubSaharanAfrica_shp) 

# Add gradient 
pix_SSA <- pix_SSA + scale_fill_gradientn(name = "Average median \nincidence", 
                                          colors =  c("seashell", "tomato", "darkred"), 
                                          na.value = 'lightblue2') + 

  # Add pixel size legend layer with title
  geom_tile(data = pix_legend, mapping = aes(x = x, y = y, fill = z)) +
  annotate("text", x = min(pix_legend$x), y = max(pix_legend$y) + 1.5,
           label = "Pixel sizes", family = "serif", 
           size = 2.5, hjust = 0) 

# Plot 
pix_SSA

Following the example above, we can plot the bespoke pixelation of the central Africa region (i.e. pixelation of the predictions within the boxed region of the unpixelated plot as per the uncertainties within the boxed region of the uncertainty plot):

# Pixel size legend
gap_y <- 5 # Number of observations between example pixels and left edge of the plot
gap_x <- 5 # Number of observations from the bottom edge of the plot
pix_legend <- px_CA$pix_df # Copy coordinates
pix_legend$z <- NA # Set z to NA
bigk <- nrow(px_CA$opp) # Extract the number of pixel sizes

csy <- c(0,cumsum(px_CA$opp$y)) # Cumulatively add pixel sizes 
xs <- sort(unique(pix_legend$x)) # Extract unique longitude (degrees)
ys <- sort(unique(pix_legend$y)) # Extract unique latitude (degrees)

for(k in 1:bigk){ # For each of the pixel sizes
  ind_y <- gap_y*k + ((csy[k] + 1):csy[k+1]) # Indices of latitude of example pixels
  ind_x <- (gap_x + 1):(gap_x + px_CA$opp$x[k]) # Indices of longitude of example pixels

  # Indices in the observation data frame
  inds <- which(pix_legend$x %in% xs[ind_x] & pix_legend$y %in% ys[ind_y]) 
  pix_legend$z[inds] <- 0 # Set location measure of examples pixel predictions to zero
}

# Remove all NAs
pix_legend <- pix_legend[!is.na(pix_legend$z), ]

# Create a temporary data frame and set z to pixelated pix_z
temp_obs <- px_CA$pix_df
temp_obs$z <- temp_obs$pix_z

# Generate a pixelated plot
pix_CA <- plot_sp_pred(sp_pred = temp_obs, shp_file = CentralAfrica_shp, 
                       legend_position = "right")


# Add gradient
pix_CA <- pix_CA + scale_fill_gradientn(name = "Average \nmedian \nincidence",
                                        colors =  c("seashell", "tomato", "darkred"),
                                        na.value = 'lightblue2') +

  # Add pixel size legend layer with title
  geom_tile(data = pix_legend, mapping = aes(x = x, y = y, fill = z)) +
  annotate("text", x = min(pix_legend$x), y = max(pix_legend$y) + 0.8,
           label = "Pixel \nsizes", family = "serif",
           size = 2, hjust = 0)


# Add border 
pix_CA_border <- pix_CA + theme(panel.border = element_rect(colour = "black", fill = NA, size = 1))

# Plot
pix_CA_border

Note that the pixelated plots of both sub-Saharan Africa and central Africa feature perfectly resolved blue and white regions. This is because, importantly, missing predictions (coloured blue) and predictions that are zero with certainty (coloured white) are excluded from the entire pixelation process (i.e. computation and classification of average uncertainty, and computation of average prediction across large and nested pixel sizes). Missing predictions and predictions that are zero with certainty thus appear exactly as they do in the unpixelated plot and the plot of uncertainty.

Understanding how average uncertainty relates to pixel size

As mentioned above pixelate works by grouping predictions into a specified number of large pixels and for each large pixel either averaging predictions across it or across smaller pixels nested within it depending on its average uncertainty. (These averaged predictions are then plotted.) Specifically, each large pixel is allocated a nested pixel size depending on which quantile interval its average uncertainty falls into, where the number of quantile intervals is equal to the number of different nested pixel sizes and the quantiles are based on the empirical distribution of average uncertainty. For example, in the pixelated plot of sub-Saharan Africa the nested pixel sizes correspond to the following quantile intervals.

# Extract bigk
bigk <- nrow(px_SSA$opp)

# Define quantile intervals
qs <- format(as.numeric(gsub('%', '',names(px_SSA$uncertainty_breaks)))/100, 
             digits = 2)
quantile_intervals <- sapply(2:(bigk+1), function(k){
  bounds <- qs[c(k-1,k)]
  paste(bounds, collapse = ' to ')
})

# Define average uncertainty intervals
avus <- format(px_SSA$uncertainty_breaks, digits = 2)
avuncertainty_intervals <- sapply(2:(bigk+1), function(k){
  bounds <- avus[c(k-1,k)]
  paste(bounds, collapse = ' to ')
})

# Group together
pixel_table <- cbind(quantile_intervals, avuncertainty_intervals, px_SSA$opp) 

# Generate table
knitr::kable(pixel_table, row.names = FALSE, 
             col.names = c("Quantile interval", 
                           "Average uncertainty", 
                           "Nested pixel width (obs)", 
                           "Nested pixel height (obs)"), 
             align = rep("c",ncol(pixel_table)))

We can plot the large pixel allocations to see how uncertainty varies across large pixels (i.e. which quantile interval each large pixel falls into):

# Set z equal to the large pixel average uncertainty quantile interval index
temp_obs <- px_SSA$pix_df
temp_obs$z <- as.factor(px_SSA$pix_df$bins)

# Generate and print the plot
plot_sp_pred(sp_pred = temp_obs, shp_file = SubSaharanAfrica_shp) +
  scale_fill_discrete(name = "Average uncertainty \nquantile interval", 
                      na.value = 'lightblue2',
                      h = c(120, 360), 
                      labels = c(quantile_intervals, 'NA'))

From the plot and table of quantile intervals, we can see that the greenest regions have high certainty and thus predictions in these regions of the pixelated plot of sub-Saharan Africa are not averaged (the nested pixels are one by one observation). Meanwhile predictions in the pixelated plot of sub-Saharan Africa that fall with blue regions of the quantile plot are averaged over moderately sized nested pixels (six by six observations). The redest regions (that cover large parts of the Democratic Republic of the Congo) have high uncertainty; predictions in the pixelated plot of sub-Saharan Africa that fall within pink regions of the quantile plot are averaged across the entire large pixel, which is 48 by 48 observations.