In mikejohnson51/zonal: Performant zonal statistics over flexible units

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  echo = TRUE, 
  warning=FALSE, 
  message=FALSE
)
devtools::load_all()

Intro

Here are some basic comparisons between the intersectr and the proposed zonal workflows. zonal was written to support ngen catchment characteristic creation. Where it excels in in the ability to (1) execute on larger data sets with more intersection units in less time using less memory, (2) provider a simpler workflow (3) the ability to handle categorical data and (4) the ability to handel tif data.

Both workflows follow the same 3 steps of:

creating a vector-based representation of gridded domain
determining the percentage of each aggregation unit that falls in each grid cell
computing the weighted average per aggregation unit

Libraries used in one or both approaches

library(dplyr)
library(data.table)
library(RNetCDF)
library(sf)

library(ncmeta)

library(zonal)
library(intersectr)

Base data

The gridded data and aggregate units we are working with can be seen below:

file = '/Users/mjohnson/Downloads/pet_1979.nc'
(meta = nc_dims(file))

The grid we are working with holds daily PET values for CONUS for the year 1979. It has: - r prettyNum(filter(meta, name == 'lon')$length, big.mark = ",") X_cells - r prettyNum(filter(meta, name == 'lat')$length, big.mark = ",") Y_cells - r prettyNum(filter(meta, name == 'day')$length, big.mark = ",") time slices

for a total of r prettyNum(prod(meta$length), big.mark = ",") values.

geom <- read_sf('/Users/mjohnson/github/hydrofabric/workflow/nhd_workflows/cache/ngen_01a-4.gpkg', "catchments") %>% 
  st_make_valid()

glimpse(geom)

In total we have r prettyNum(nrow(geom), big.mark = ",") aggregation units to summarize over the r filter(meta, name == 'day')$length time steps.

Step 1: Creating Cell Geometries

The cell geometry workflow for intersectr is wrapped into a single function and compared to the analogous function build_grid().

wzxhzdk:4

bnch <- bench::mark(
  iterations = 1, check = FALSE,
  intersectr = intersectr_cell_geom(file, geom),
  zonal = build_grid(file, geom)
)

cell_geometry = intersectr_cell_geom(file, geom)
grid          = build_grid(file, geom)

bnch %>%
  dplyr::select(expression, median, mem_alloc) %>%
  mutate(expression = names(expression),
         median_rel = unclass(median/min(median)),
         mem_rel = unclass(mem_alloc/min(mem_alloc))) %>% 
  formattable::formattable()

Step 2: Generating Areal Weights

The areal weights workflow for intersectr is wrapped into a single function and compared to the analogous function in build_weights.

wzxhzdk:7

bnch2 <- bench::mark(
  iterations = 1, check = FALSE,
  intersectr = intersectr_weights(cell_geometry , geom),
  zonal      = build_weights(grid, geom, "comid")
)

area_weights = intersectr_weights(cell_geometry , geom)
w            = build_weights(grid, geom, "comid")

bnch2 %>%
  dplyr::select(expression, median, mem_alloc) %>%
  mutate(expression = names(expression),
         median_rel = unclass(median/min(median)),
         mem_rel = unclass(mem_alloc/min(mem_alloc))) %>% 
  formattable::formattable()

Step 3: Execute

The execution workflow for intersectr is wrapped into a single function and compared to the analogous function in execute.

wzxhzdk:10

bnch3 <- bench::mark(
  iterations = 1, check = FALSE,
  intersectr = intersectr_execute(file, 
                                  "potential_evapotranspiration", 
                                  cell_geometry, 
                                  area_weights),
  zonal = execute(file, w)
)

intersectr = intersectr_execute(file, "potential_evapotranspiration", cell_geometry, area_weights)
zonal = execute(file,  w)

bnch3 %>%
  dplyr::select(expression, median, mem_alloc) %>%
  mutate(expression = names(expression),
         median_rel = unclass(median/min(median)),
         mem_rel = unclass(mem_alloc/min(mem_alloc))) %>% 
  formattable::formattable()

Comparision

intersectr and zonal output data in a different structures, intersectr places time steps as rows, and zonal favors units as rows. The latter is easier to work with when the number of units > then the time steps while the former favor's succinct time series plotting. To make sure we get the same results, we transpose the zonal data to match the structure of the intersectr.

# Intersectr
dim(intersectr)

# Zonal
dim(zonal)

system.time({
  zonal2 = transpose(zonal, keep.names = "time_stamp", make.names = "comid")
})

dim(zonal2)

And test a random catchment...

{plot(intersectr$`857`, zonal2$`857`, pch = 16, cex = .75, 
      xlab = "intersectr (857)",
      ylab = "zonal (857)",
      main = "Comparison")
  abline(0,1, col = "red")
}

Great they match r emo::ji('tada')

Lost Catchments ??

Above we seeintersectr retained 16 catchments that zonal lost To isolate these we can identify those not shared in each set and then map and check the results.

library(leaflet)
(n = names(intersectr)[which(!names(intersectr) %in% names(zonal2))])

leaflet() %>% 
  addTiles() %>% 
  addPolygons(data = st_transform(filter(geom, comid %in%  n ), 4326))

formattable::formattable(intersectr[1:5,n])