R/multi_scale.R
In waywiser: Ergonomic Methods for Assessing Spatial Models

Documented in ww_multi_scale

#' Evaluate metrics at multiple scales of aggregation
#'
#' @section Raster inputs:
#'
#' If `data` is `NULL`, then `truth` and `estimate` should both be `SpatRaster`
#' objects, as created via [terra::rast()]. These rasters will then be
#' aggregated to each grid using [exactextractr::exact_extract()]. If `data`
#' is a `SpatRaster` object, then `truth` and `estimate` should be indices to
#' select the appropriate layers of the raster via [terra::subset()].
#'
#' Grids are calculated using the bounding box of `truth`, under the assumption
#' that you may have extrapolated into regions which do not have matching "true"
#' values. This function does not check that `truth` and `estimate` overlap at
#' all, or that they are at all contained within the grid.
#'
#' @section Creating grid blocks:
#'
#' The grid blocks can be controlled by passing arguments to
#' [sf::st_make_grid()] via `...`. Some particularly useful arguments include:
#'
#' * `cellsize`: Target cellsize, expressed as the "diameter" (shortest
#' straight-line distance between opposing sides; two times the apothem)
#' of each block, in map units.
#' * `n`: The number of grid blocks in the x and y direction (columns, rows).
#' * `square`: A logical value indicating whether to create square (`TRUE`) or
#' hexagonal (`FALSE`) cells.
#'
#' If both `cellsize` and `n` are provided, then the number of blocks requested
#' by `n` of sizes specified by `cellsize` will be returned, likely not
#' lining up with the bounding box of `data`. If only `cellsize`
#' is provided, this function will return as many blocks of size
#' `cellsize` as fit inside the bounding box of `data`. If only `n` is provided,
#' then `cellsize` will be automatically adjusted to create the requested
#' number of cells.
#'
#' Grids are created by mapping over each argument passed via `...`
#' simultaneously, in a similar manner to [mapply()] or [purrr::pmap()]. This
#' means that, for example, passing `n = list(c(1, 2))` will create a single
#' 1x2 grid, while passing `n = c(1, 2)` will create a 1x1 grid _and_ a 2x2
#' grid. It also means that arguments will be recycled using R's standard
#' vector recycling rules, so that passing `n = c(1, 2)` and `square = FALSE`
#' will create two separate grids of hexagons.
#'
#' This function can be used for geographic or projected coordinate reference
#' systems and expects 2D data.
#'
#' @param data Either: a point geometry `sf` object containing the columns
#' specified by the `truth` and `estimate` arguments; a `SpatRaster` from
#' the `terra` package containing layers specified by the `truth` and `estimate`
#' arguments; or `NULL` if `truth` and `estimate` are `SpatRaster` objects.
#' @param truth,estimate If `data` is an `sf` object, the names (optionally
#' unquoted) for the columns in `data` containing the true and predicted values,
#' respectively. If `data` is a `SpatRaster` object, either (quoted) layer names or
#' indices which will select the true and predicted layers, respectively, via
#' [terra::subset()] If `data` is `NULL`, `SpatRaster` objects with a single
#' layer containing the true and predicted values, respectively.
#' @param metrics Either a [yardstick::metric_set()] object, or a list of
#' functions which will be used to construct a [yardstick::metric_set()] object
#' specifying the performance metrics to evaluate at each scale.
#' @param grids Optionally, a list of pre-computed `sf` or `sfc` objects
#' specifying polygon boundaries to use for assessments.
#' @param ... Arguments passed to [sf::st_make_grid()].
#' **You almost certainly should provide these arguments as lists.**
#' For instance, passing `n = list(c(1, 2))` will create a single 1x2 grid;
#' passing `n = c(1, 2)` will create a 1x1 grid _and_ a 2x2 grid.
#' @param aggregation_function The function to use to aggregate predictions and
#' true values at various scales, by default [mean()]. For the `sf` method,
#' you can pass any function which takes a single vector and returns a scalar.
#' For raster methods, any function accepted by
#' [exactextractr::exact_extract()] (note that built-in function names must be
#' quoted). Note that this function does _not_ pay attention to the value of
#' `na_rm`; any NA handling you want to do during aggregation should be handled
#' by this function directly.
#' @param na_rm Boolean: Should polygons with NA values be removed before
#' calculating metrics? Note that this does _not_ impact how values are
#' aggregated to polygons: if you want to remove NA values before aggregating,
#' provide a function to `aggregation_function` which will remove NA values.
#' @param autoexpand_grid Boolean: if `data` is in geographic coordinates and
#' `grids` aren't provided, the grids generated by [sf::st_make_grid()] may not
#' contain all observations. If `TRUE`, this function will automatically expand
#' generated grids by a tiny factor to attempt to capture all observations.
#' @param progress Boolean: if `data` is `NULL`, should aggregation via
#' [exactextractr::exact_extract()] show a progress bar? Separate progress bars
#' will be shown for each time `truth` and `estimate` are aggregated.
#'
#' @return A tibble with six columns: `.metric`, with the name
#' of the metric that the row describes; `.estimator`, with the name of the
#' estimator used, `.estimate`, with the output of the metric function;
#' `.grid_args`, with the arguments passed to [sf::st_make_grid()] via `...`
#' (if any), `.grid`, containing the grids used to aggregate predictions,
#' as well as the aggregated values of `truth` and `estimate` as well as the
#' count of non-NA values for each, and `.notes`, which (if `data` is an `sf`
#' object) will indicate any observations which were not used in a given
#' assessment.
#'
#' @examplesIf rlang::is_installed("modeldata")
#' data(ames, package = "modeldata")
#' ames_sf <- sf::st_as_sf(ames, coords = c("Longitude", "Latitude"), crs = 4326)
#' ames_model <- lm(Sale_Price ~ Lot_Area, data = ames_sf)
#' ames_sf$predictions <- predict(ames_model, ames_sf)
#'
#' ww_multi_scale(
#'   ames_sf,
#'   Sale_Price,
#'   predictions,
#'   n = list(
#'     c(10, 10),
#'     c(1, 1)
#'   ),
#'   square = FALSE
#' )
#'
#' # or, mostly equivalently
#' # (there will be a slight difference due to `autoexpand_grid = TRUE`)
#' grids <- list(
#'   sf::st_make_grid(ames_sf, n = c(10, 10), square = FALSE),
#'   sf::st_make_grid(ames_sf, n = c(1, 1), square = FALSE)
#' )
#' ww_multi_scale(ames_sf, Sale_Price, predictions, grids = grids)
#'
#' @references
#' Riemann, R., Wilson, B. T., Lister, A., and Parks, S. (2010). "An effective
#' assessment protocol for continuous geospatial datasets of forest
#' characteristics using USFS Forest Inventory and Analysis (FIA) data."
#' Remote Sensing of Environment 114(10), pp 2337-2352,
#' doi: 10.1016/j.rse.2010.05.010 .
#'
#' @export
ww_multi_scale <- function(
  data = NULL,
  truth,
  estimate,
  metrics = list(yardstick::rmse, yardstick::mae),
  grids = NULL,
  ...,
  na_rm = TRUE,
  aggregation_function = "mean",
  autoexpand_grid = TRUE,
  progress = TRUE
) {
  if (length(na_rm) != 1 || !is.logical(na_rm)) {
    rlang::abort("Only one logical value can be passed to `na_rm`.")
  }

  if (missing(data) || is.null(data)) {
    ww_multi_scale_raster_args(
      data = data,
      truth = truth,
      estimate = estimate,
      metrics = metrics,
      grids = grids,
      ...,
      na_rm = na_rm,
      aggregation_function = aggregation_function,
      autoexpand_grid = autoexpand_grid,
      progress = progress
    )
  } else {
    UseMethod("ww_multi_scale", data)
  }
}

#' @exportS3Method
ww_multi_scale.SpatRaster <- function(
  data = NULL,
  truth,
  estimate,
  metrics = list(yardstick::rmse, yardstick::mae),
  grids = NULL,
  ...,
  na_rm = TRUE,
  aggregation_function = "mean",
  autoexpand_grid = TRUE,
  progress = TRUE
) {
  rlang::check_installed("terra")
  rlang::check_installed("exactextractr")

  data <- prep_multi_scale_raster(data, truth, estimate)
  metrics <- handle_metrics(metrics)
  grid_list <- handle_grids(data, grids, autoexpand_grid, sf::st_crs(data), ...)

  grid_list$grids <- lapply(
    grid_list$grids,
    spatraster_extract,
    data,
    aggregation_function,
    progress
  )

  .notes <- raster_method_notes(grid_list)

  raster_method_summary(grid_list, .notes, metrics, na_rm)
}

prep_multi_scale_raster <- function(data, truth, estimate) {
  data <- tryCatch(
    terra::subset(data, c(truth, estimate)),
    error = function(e) {
      rlang::abort(
        "Couldn't select either `truth` or `estimate`. Are your indices correct?"
      )
    }
  )

  if (terra::nlyr(data) != 2) {
    rlang::abort(c(
      "`terra::subset(data, c(truth, estimate))` didn't return 2 layers as expected.",
      i = "Make sure `truth` and `estimate` both select exactly one layer."
    ))
  }
  names(data) <- c("truth", "estimate")
  data
}

spatraster_extract <- function(grid, data, aggregation_function, progress) {
  grid <- sf::st_as_sf(grid)
  sf::st_geometry(grid) <- "geometry"
  exactextract_names <- c(".truth", ".estimate")
  if (
    !rlang::is_function(aggregation_function) && aggregation_function != "count"
  ) {
    exactextract_names <- c(
      exactextract_names,
      ".truth_count",
      ".estimate_count"
    )
    aggregation_function <- c(aggregation_function, "count")
  }
  grid_df <- exactextractr::exact_extract(
    data,
    grid,
    fun = aggregation_function,
    progress = progress
  )
  names(grid_df) <- exactextract_names

  if (length(exactextract_names) == 4L) {
    exactextract_names <- exactextract_names[c(1, 3, 2, 4)]
  }

  cbind(grid, grid_df)[c(exactextract_names, "geometry")]
}

ww_multi_scale_raster_args <- function(
  data = NULL,
  truth,
  estimate,
  metrics = list(yardstick::rmse, yardstick::mae),
  grids = NULL,
  ...,
  na_rm = TRUE,
  aggregation_function = "mean",
  autoexpand_grid = TRUE,
  progress = TRUE
) {
  rlang::check_installed("terra")
  rlang::check_installed("exactextractr")

  if (!inherits(truth, "SpatRaster") || terra::nlyr(truth) != 1) {
    rlang::abort("`truth` must be a SpatRaster with only one layer.")
  }

  if (!inherits(estimate, "SpatRaster") || terra::nlyr(estimate) != 1) {
    rlang::abort("`estimate` must be a SpatRaster with only one layer.")
  }

  if (sf::st_crs(truth) != sf::st_crs(estimate)) {
    rlang::abort("`truth` and `estimate` must share a CRS.")
  }

  metrics <- handle_metrics(metrics)
  grid_list <- handle_grids(
    truth,
    grids,
    autoexpand_grid,
    sf::st_crs(data),
    ...
  )

  grid_list$grids <- lapply(
    grid_list$grids,
    function(grid) {
      grid <- sf::st_as_sf(grid)
      sf::st_geometry(grid) <- "geometry"
      if (
        rlang::is_function(aggregation_function) ||
          aggregation_function == "count"
      ) {
        grid$.truth <- exactextractr::exact_extract(
          truth,
          grid,
          fun = aggregation_function,
          progress = progress
        )
        grid$.estimate <- exactextractr::exact_extract(
          estimate,
          grid,
          fun = aggregation_function,
          progress = progress
        )
        grid[c(".truth", ".estimate", "geometry")]
      } else {
        truth_df <- exactextractr::exact_extract(
          truth,
          grid,
          fun = c(aggregation_function, "count"),
          progress = progress
        )
        names(truth_df) <- c(".truth", ".truth_count")

        estimate_df <- exactextractr::exact_extract(
          estimate,
          grid,
          fun = c(aggregation_function, "count"),
          progress = progress
        )
        names(estimate_df) <- c(".estimate", ".estimate_count")

        cbind(grid, truth_df, estimate_df)[c(
          ".truth",
          ".truth_count",
          ".estimate",
          ".estimate_count",
          "geometry"
        )]
      }
    }
  )

  .notes <- raster_method_notes(grid_list)

  raster_method_summary(grid_list, .notes, metrics, na_rm)
}

raster_method_notes <- function(grid_list) {
  lapply(
    seq_along(grid_list$grids),
    function(idx) {
      tibble::tibble(
        note = character(0),
        missing_indices = list()
      )
    }
  )
}

raster_method_summary <- function(grid_list, .notes, metrics, na_rm) {
  class_metrics <- FALSE
  prob_metrics <- FALSE
  if (inherits(metrics, "class_prob_metric_set")) {
    is_class_metric <- tibble::as_tibble(metrics)$class == "class_metric"

    if (any(is_class_metric) && !all(is_class_metric)) {
      rlang::abort(
        c(
          "`ww_multi_scale` can't handle mixed classification and class probability metric sets.",
          i = "Call `ww_multi_scale()` twice: once for classification metrics, and once for class probability metrics."
        ),
        class = "waywiser_mixed_metrics"
      )
    }

    class_metrics <- all(is_class_metric)
    prob_metrics <- !class_metrics
  }

  if (class_metrics || prob_metrics) {
    lvls <- unique(
      unlist(
        lapply(
          grid_list$grids,
          function(grid) {
            out <- levels(factor(grid$.truth))
            if (class_metrics) {
              out <- c(out, levels(factor(grid$.estimate)))
            }
            out
          }
        )
      )
    )

    grid_list$grids <- lapply(
      grid_list$grids,
      function(grid) {
        grid$.truth <- factor(grid$.truth, levels = lvls)
        if (class_metrics) {
          grid$.estimate <- factor(grid$.estimate, levels = lvls)
        }
        grid
      }
    )
  }

  out <- mapply(
    function(grid, grid_arg, .notes) {
      if (prob_metrics) {
        out <- metrics(
          as.data.frame(grid),
          truth = .truth,
          .estimate,
          na_rm = na_rm
        )
      } else {
        out <- metrics(
          grid,
          truth = .truth,
          estimate = .estimate,
          na_rm = na_rm
        )
      }
      out[attr(out, "sf_column")] <- NULL
      out$.grid_args <- list(grid_list$grid_args[grid_arg, ])
      out$.grid <- list(grid)
      out$.notes <- list(.notes)
      out
    },
    grid = grid_list$grids,
    grid_arg = grid_list$grid_arg_idx,
    .notes = .notes,
    SIMPLIFY = FALSE
  )
  do.call(dplyr::bind_rows, out)
}

#' @exportS3Method
ww_multi_scale.sf <- function(
  data,
  truth,
  estimate,
  metrics = list(yardstick::rmse, yardstick::mae),
  grids = NULL,
  ...,
  na_rm = TRUE,
  aggregation_function = "mean",
  autoexpand_grid = TRUE,
  progress = TRUE
) {
  if (nrow(data) == 0) {
    rlang::abort(
      "0 rows were passed to `data`."
    )
  }

  check_multi_scale_data(data)
  metrics <- handle_metrics(metrics)

  truth_var <- tidyselect::eval_select(rlang::expr({{ truth }}), data)
  estimate_var <- tidyselect::eval_select(rlang::expr({{ estimate }}), data)

  if (!is.numeric(data[[truth_var]])) {
    rlang::abort("`truth` must be numeric.")
  }

  if (!is.numeric(data[[estimate_var]])) {
    rlang::abort("`estimate` must be numeric.")
  }

  data_crs <- sf::st_crs(data)
  grid_list <- handle_grids(data, grids, autoexpand_grid, data_crs, ...)

  data$.grid_idx <- seq_len(nrow(data))
  out <- mapply(
    function(grid, grid_args_idx) {
      grid_args <- grid_list[["grid_args"]][grid_args_idx, ]

      grid <- prep_grid(data, grid, data_crs)
      grid$grid_cell_idx <- seq_len(nrow(grid))

      grid_matches <- sf::st_join(
        grid,
        data[".grid_idx"],
        left = FALSE
      )
      grid_matches <- sf::st_drop_geometry(grid_matches)

      notes_tibble <- make_notes_tibble(data, grid_matches)

      matched_data <- match_data(
        data,
        grid_matches,
        matched_data,
        truth_var,
        estimate_var,
        aggregation_function
      )

      out <- metrics(
        matched_data,
        truth = .truth,
        estimate = .estimate,
        na_rm = na_rm
      )
      out["grid_cell_idx"] <- NULL
      out[attr(out, "sf_column")] <- NULL
      out$.grid_args <- list(grid_args)
      .grid <- dplyr::left_join(
        grid,
        matched_data,
        by = dplyr::join_by(grid_cell_idx)
      )
      .grid["grid_cell_idx"] <- NULL
      out$.grid <- list(.grid)
      out$.notes <- list(notes_tibble)
      out
    },
    grid = grid_list$grids,
    grid_args_idx = grid_list$grid_arg_idx,
    SIMPLIFY = FALSE
  )
  out <- dplyr::bind_rows(out)

  if (any(vapply(out[[".notes"]], function(x) nrow(x) > 0, logical(1)))) {
    rlang::warn(
      c(
        "Some observations were not within any grid cell, and as such were not used in any assessments.",
        i = "See the `.notes` column for details."
      )
    )
  }

  out
}

handle_metrics <- function(metrics) {
  if (inherits(metrics, "metric")) metrics <- list(metrics)
  if (!inherits(metrics, "metric_set")) {
    metrics <- do.call(yardstick::metric_set, metrics)
  }
  metrics
}

handle_grids <- function(data, grids, autoexpand_grid, data_crs, ...) {
  if (is.null(grids)) {
    grid_args <- rlang::list2(...)
    if ("crs" %in% names(grid_args)) {
      rlang::warn(
        c(
          "The `crs` argument (passed via `...`) will be ignored.",
          i = "Grids will be created using the same crs as `data`."
        ),
        call = rlang::caller_env()
      )
      grid_args["crs"] <- NULL
    }
    grid_arg_idx <- max(vapply(grid_args, length, integer(1)))
    grid_args <- stats::setNames(
      lapply(
        grid_args,
        function(x) rep(x, length.out = grid_arg_idx)
      ),
      names(grid_args)
    )
    grid_args <- tibble::as_tibble(grid_args)
    grid_arg_idx <- seq_len(nrow(grid_args))

    grid_box <- sf::st_bbox(data)
    if (is_longlat(data) && autoexpand_grid) {
      # cf https://github.com/ropensci/stplanr/pull/467
      # basically: spherical geometry means sometimes the straight line of the
      # grid will exclude points within the bounding box
      #
      # so here we'll expand our boundary by a small bit in order to always contain our
      # points within the grid
      grid_box <- expand_grid(grid_box)
    }
    grids <- lapply(
      grid_arg_idx,
      function(idx) {
        arg <- lapply(
          names(grid_args),
          function(arg) {
            grid_args[[arg]][[idx]]
          }
        )
        names(arg) <- names(grid_args)
        do.call(
          sf::st_make_grid,
          c(x = list(grid_box), crs = list(data_crs), arg)
        )
      }
    )
  } else {
    rlang::check_dots_empty(call = rlang::caller_env())
    grid_args <- tibble::tibble()
    grid_arg_idx <- 0
    if (!is.na(data_crs)) {
      grids <- lapply(grids, sf::st_transform, sf::st_crs(data))
    }
  }

  list(
    grids = grids,
    grid_args = grid_args,
    grid_arg_idx = grid_arg_idx
  )
}

match_data <- function(
  data,
  grid_matches,
  matched_data,
  truth_var,
  estimate_var,
  aggregation_function
) {
  matched_data <- dplyr::left_join(
    data,
    grid_matches,
    by = dplyr::join_by(.grid_idx)
  )
  matched_data <- sf::st_drop_geometry(matched_data)
  matched_data <- matched_data[!is.na(matched_data[["grid_cell_idx"]]), ]
  matched_data <- dplyr::group_by(
    matched_data,
    dplyr::across(dplyr::all_of(c(dplyr::group_vars(data), "grid_cell_idx")))
  )

  matched_data <- dplyr::summarise(
    matched_data,
    .truth = rlang::exec(
      .env[["aggregation_function"]],
      .data[[names(truth_var)]]
    ),
    .truth_count = sum(!is.na(.data[[names(truth_var)]])),
    .estimate = rlang::exec(
      .env[["aggregation_function"]],
      .data[[names(estimate_var)]]
    ),
    .estimate_count = sum(!is.na(.data[[names(estimate_var)]])),
    .groups = "drop"
  )

  if (dplyr::is_grouped_df(data)) {
    matched_data <- dplyr::group_by(matched_data, !!!dplyr::groups(data))
  }
  matched_data
}

prep_grid <- function(data, grid, data_crs) {
  grid <- sf::st_as_sf(grid)
  grid_crs <- sf::st_crs(grid)
  # If both have a CRS, reproject
  if (!is.na(data_crs) && !is.na(grid_crs) && (grid_crs != data_crs)) {
    grid <- sf::st_transform(grid, data_crs)
    # if only data has CRS, assume grid in same
  } else if (!is.na(data_crs)) {
    grid <- sf::st_set_crs(grid, data_crs)
  }
  # if neither has a CRS, ignore (so, implicitly assume grid is in same)
  grid
}

make_notes_tibble <- function(data, grid_matches) {
  missing <- setdiff(data[[".grid_idx"]], grid_matches[[".grid_idx"]])
  note <- character(0)
  if (length(missing) > 0) {
    note <- "Some observations were not within any grid cell, and as such were not used in any assessments. Their row numbers are in the `missing_indices` column."
    missing <- list(missing)
  } else {
    missing <- list()
  }

  tibble::tibble(
    note = note,
    missing_indices = missing
  )
}

check_multi_scale_data <- function(data) {
  if (
    any(
      names(data) %in%
        c(".truth", ".estimate", ".truth_count", ".estimate_count")
    )
  ) {
    rlang::abort(
      c(
        "This function cannot work with data whose columns are named `.truth`, `.estimate`, `.truth_count`, or `estimate_count`.",
        i = "Rename the relevant columns and try again."
      ),
      call = rlang::caller_env()
    )
  }

  geom_type <- unique(sf::st_geometry_type(data))
  if (!(length(geom_type) == 1 && geom_type == "POINT")) {
    rlang::abort(
      c(
        "ww_multi_scale is currently only implemented for point geometries.",
        i = "Consider casting your data to points."
      ),
      call = rlang::caller_env()
    )
  }
}

#' Expand geographic bounding boxes slightly
#'
#' Because we're drawing straight lines on spheres when working with geographic
#' coordinates, it's entirely possible to have points within a bounding box but
#' outside of the straight lines between the corners. As this is almost never
#' expected, this function adds a tiny fudge factor to bounding boxes in order
#' to "catch" points.
#'
#' @param grid_box The output from [sf::st_bbox()]
#' @param expansion The expansion factor: what fraction should each coordinate
#' be adjusted by?
#'
#' @return A very slightly buffered bounding box
#'
#' @references
#' https://github.com/ropensci/stplanr/pull/467
#'
#' @noRd
expand_grid <- function(grid_box, expansion = 0.00001) {
  grid_box[1] <- grid_box[1] - abs(grid_box[1] * expansion)
  grid_box[2] <- grid_box[2] - abs(grid_box[2] * expansion)
  grid_box[3] <- grid_box[3] + abs(grid_box[3] * expansion)
  grid_box[4] <- grid_box[4] + abs(grid_box[4] * expansion)
  grid_box
}

#' Check if an sf object is in geographic coordinates
#'
#' This function adjusts [sf::st_is_longlat()] so that data without a CRS,
#' such as simulated data on arbitrary grids, is treated as non-geographic.
#'
#' @inheritParams sf::st_is_longlat
#'
#' @noRd
is_longlat <- function(x) {
  !(sf::st_crs(x) == sf::NA_crs_) && sf::st_is_longlat(x)
}