In epiforecasts/scoringutils: Utilities for Scoring and Assessing Predictions

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

This Vignette provides a small collection of functions that have been deprecated in scoringutils. These functions are no longer, but may still prove useful or illustrative.

`merge_pred_and_obs()`

scoringutils requires that both forecasts and observations are provided in a single data frame. If you have forecasts and observations in two different data frames, merge_pred_and_obs() may help you to merge the two. The function is mostly a wrapper around merge(), but does some additional work to deal with duplicated column names.

#' @title Merge forecast data and observations
#'
#' @description
#'
#' The function more or less provides a wrapper around `merge` that
#' aims to handle the merging well if additional columns are present
#' in one or both data sets. If in doubt, you should probably merge the
#' data sets manually.
#'
#' @param forecasts A data.frame with the forecast data (as can be passed to
#'   [score()]).
#' @param observations A data.frame with the observations.
#' @param join Character, one of `c("left", "full", "right")`. Determines the
#'   type of the join. Usually, a left join is appropriate, but sometimes you
#'   may want to do a full join to keep dates for which there is a forecast, but
#'   no ground truth data.
#' @param by Character vector that denotes the columns by which to merge. Any
#'   value that is not a column in observations will be removed.
#' @return a data.table with forecasts and observations
#' @importFrom checkmate assert_subset
#' @importFrom data.table as.data.table
#' @keywords data-handling
#' @export

merge_pred_and_obs <- function(forecasts, observations,
                               join = c("left", "full", "right"),
                               by = NULL) {
  forecasts <- as.data.table(forecasts)
  observations <- as.data.table(observations)
  join <- match.arg(join)
  assert_subset(by, intersect(names(forecasts), names(observations)))

  if (is.null(by)) {
    protected_columns <- c(
      "predicted", "observed", "sample_id", "quantile_level",
      "interval_range", "boundary"
    )
    by <- setdiff(colnames(forecasts), protected_columns)
  }

  obs_cols <- colnames(observations)
  by <- intersect(by, obs_cols)

  join <- match.arg(join)

  if (join == "left") {
    # do a left_join, where all data in the observations are kept.
    combined <- merge(observations, forecasts, by = by, all.x = TRUE)
  } else if (join == "full") {
    # do a full, where all data is kept.
    combined <- merge(observations, forecasts, by = by, all = TRUE)
  } else {
    combined <- merge(observations, forecasts, by = by, all.y = TRUE)
  }


  # get colnames that are the same for x and y
  colnames <- colnames(combined)
  colnames_x <- colnames[endsWith(colnames, ".x")]
  colnames_y <- colnames[endsWith(colnames, ".y")]

  # extract basenames
  basenames_x <- sub(".x$", "", colnames_x)
  basenames_y <- sub(".y$", "", colnames_y)

  # see whether the column name as well as the content is the same
  content_x <- as.list(combined[, ..colnames_x])
  content_y <- as.list(combined[, ..colnames_y])
  overlapping <- (content_x %in% content_y) & (basenames_x == basenames_y)
  overlap_names <- colnames_x[overlapping]
  basenames_overlap <- sub(".x$", "", overlap_names)

  # delete overlapping columns
  if (length(basenames_overlap) > 0) {
    combined[, paste0(basenames_overlap, ".x") := NULL]
    combined[, paste0(basenames_overlap, ".y") := NULL]
  }

  return(combined[])
}