SLRSIM: Tools to Analyse Sea Level Rise

Documented in slr_slope

#' Calculate Long Term Estimate of SLR
#'
#' Given an series of (typically autocorrelated) observations of sea level or
#' mean sea level, and associated time coordinates (usually Dates), return a
#' named vector containing a summary of a generalized least squares linear trend,
#' optionally scaled to annual values. This function mimics the calculation of
#' long term estimates of sea level rise and associated uncertainties conducted
#' by NOAA. Where we have checked, results of this function, when based on
#' NOAA's seasonally adjusted mean trend data (See Details), agree with NOAA's
#' trend estimates within rounding error.
#'
#' @details
#' Sea level data tends to be highly autocorrelated over multiple time
#' scales. This function uses generalized least squares (GLS) to fit a linear
#' model with a correlation structure defined by fitting an AR1 process to model
#' residuals.
#'
#' Slope estimates from the AR1 models tend (in this context) to be
#' indistinguishable from estimates generated by ordinary least squares, but the
#' uncertainty of the estimates is substantially greater.  As observations are
#' not independent, ordinary least squares (which assumes independence) will
#' overstate model precision.
#'
#' The AR1 process, an "autoregressive process of order 1", is a simple
#' assumption about correlation structures that reduces the risk of overstating
#' the precision of trend estimates. In an AR1 model, each observation is
#' regressed against the prior observation.  Informally, An AR1 process
#' generates a correlation structure based on the lag 1 autocorrelation.
#' Modeled autocorelations between observations are thus proportional to the
#' lag 1 autocorrelation raised to the power of the number of timesteps between
#' the two observations.
#'
#' The AR1 structure, however, may not be ideal for all tidal time series.
#' Actual tidal data series tend to have more complex autocorrelation structure.
#' Where regular seasonal patterns are evident, explicit modeling is probably
#' called for. Thus the function is principally intended for use with data
#' averaged over monthly or longer periods.
#'
#' This function is a convenience wrapper around `nlme::gls(..., correlation =
#' corAR1())`. For more control, or to examine model diagnostics, the user
#' should fit the generalized least squares model directly.
#'
#' The function does not fit periodic terms. Thus it is generally
#' not suitable for use with "raw" sea level observations, especially those
#' collected at high resolution, like NOAA hourly data. Tidal data contains many
#' periodic components (~ 24:50 tidal period, spring tide / neap tide cycles,
#' annual cycles, etc.). Those periodic structures will affect estimates of
#' autocorrelation, ultimately making estimates of model uncertainty less
#' reliable. If working from high frequency data, short-term structure should be
#' removed before using this function, typically by calculating average sea
#' level over convenient and meaningful time periods, like months or years.
#'
#' The function's original use case was based on secondary analysis of NOAA
#' "mean trend" data.  NOAA calculates long-term sea level rise estimates based
#' on a modified monthly water level data series. The source data is not the
#' same as what one gets if one downloads monthly mean water level through the
#' standard NOAA API. NOAA's "mean trend" data contains "monthly mean sea level
#' without the regular seasonal fluctuations due to coastal ocean temperatures,
#' salinities, winds, atmospheric pressures, and ocean currents. The seasonally
#' detrended data is available from NOAA for manual download from water level
#' station web pages, but apparently not through the API.  The `prov_meantrend`
#' data included in `SLRSIM` is an example, for Providence, Rhode Island.
#'
#' `SLRSIM` provides a a function, `get_slr_trend()`, to access NOAA's long term
#' trend analysis result directly, including error terms and period of record.
#'
#' If `t_fit = FALSE`, the function fits an autocorrelation structure to
#' the sequential observations, ordered by `.dt`, without regard to how much
#' time has passed between subsequent observations. This approach works well
#' with evenly spaced dates and complete or nearly complete records, but is
#' less successful if there are many periods with missing observations.
#'
#' If `t_fit = TRUE`, the function fits an autocorrelation structure based on
#' the time coordinate (.dt). In this case, the time coordinate must be integer
#' valued. R's Date class is an integer under the hood. It is easy to build
#' Dates, either by converting POSIX times with `as.Date()` or by building
#' them up from month, day, and year information with
#'  `as.Date(paste(year, month, 15, sep = '-')`.
#'
#' By default, analysis is conducted by years (`.mode == 'year'`). The slope
#' estimate and standard error are scaled from days (based on R Dates) or
#' seconds (based on POSXIt objects) to (approximate) annual values by
#' multiplying by 365.242 2 (days / Dates) or 31556926 (seconds / POSIXt).

#' @param .data Source data frame for data.  Use NULL if no data frame is used
#'         and all data is passed from
#'         the enclosing environment.
#' @param .sl  The data variable (usually found in the data frame) showing sea
#'         level or mean sea level.  Must be a named variable, not an
#'         expression.
#' @param .dt   Data variable containing corresponding midpoint dates for the
#'        period of averaging used to calculate .sl. Must be a named variable,
#'        not an expression. Midpoint dates for a given month can be approximated
#'        with `as.Date(paste(year, month, 15, sep = '-')`.
#' @param .ci P value for two sided confidence intervals for the estimated
#'        slope, based on a (possibly naive) normal approximation.
#' @param t_fit Should the underlying generalized linear model's correlation
#'        structure be fit based on the time coordinate (.dt), or only on the
#'        sequence of observations in the data? Setting this to TRUE is safer if
#'        you are uncertain of the sequence of observations in the source data,
#'        or significant missing values in the data, but it significantly slows
#'        model fitting.  Results tend to be very similar for complete or near
#'        complete data.
#' @param .mode one of c('year', 'unscaled') indicating whether the
#'        results should be scaled to  years (the default) or left unscaled.
#'        Other sea level rate functions accept other `.mode` values.
#'        Scaling requires the time coordinate to inherit from R `Date` or
#'        `POSIXt` objects. Rates based on `Dates` are converted to annual
#'        values by multiplying by 365.2422 days / year.  Rates based on
#'        `POSIXt` objects are converted to annual values by multiplying slope
#'        estimates by multiplying by 31556926 seconds / year (with a warning
#'        about rounding error).
#' @return
#' A named vector, with the following components and attributes.
#'  \describe{__Vector Components__
#'    \item{Estimate}{The trend. Units depend on value of the `.mode` argument.
#'    For `.mode == 'year`,  units are per year, otherwise per unit of the .dt
#'    variable.}
#'    \item{Std_Err}{Standard Error of the trend.  Units as above.}
#'    \item{P_Val}{P value of the trend.}
#'    \item{Lower_CI}{Lower confidence interval for the estimate.}
#'    \item{Upper_CI}{Upper confidence interval for teh estimate.}
#'    }
#'  \describe{__Attributes__
#'    \item{CI_P}{Nominal p value used to generate the confidence intervals.}
#'    \item{span}{How many calander years are represented in the data?}
#'    \item{start}{First year included in the data.}
#'    \item{stop}{last year included in the data.}
#'    \item{cor_struct}{Were parameters fit based on order or time }
#'}
#'
#'@family sea level rate functions
#' @export
#'
#' @examples
#'
#' prov_meantrend$MSL_mm <- prov_meantrend$MSL * 1000 (convert to mm)
#'
#' # Basic Usage
#' slr_slope(prov_meantrend, MSL_mm, MidDate)
#'
#' # if not passed dates, the result is unscaled
#'  slr_slope(prov_meantrend, MS_mm, unclass(MidDate))
#'
#' # One can also fit the model to the time coordinate explicitly
#' slr_slope(prov_meantrend, MSL_mm, MidDate, t_fit = TRUE)
#'
slr_slope <- function(.data, .sl, .dt,
                      .ci = 0.95, t_fit = FALSE,
                      .mode = c('year', 'unscaled')) {

  # Ugly argument checks, since they doesn't provide nice error messages.
  stopifnot(is.data.frame(.data) | is.null(.data))
  stopifnot(inherits(t_fit, 'logical'))

  # We want to be able to accept arguments as unquoted names or quoted names.
  # `ensym()`  captures only names, not expressions
  sl_sym <- rlang::enquo(.sl)
  date_sym<- rlang::enquo(.dt)

  sl <- rlang::eval_tidy(sl_sym, .data)
  the_date <- rlang::eval_tidy(date_sym, .data)

  .mode = match.arg(.mode)

  have_dates <- inherits(the_date, c('Date','POSIXt'))

  if(.mode == 'year' & ! have_dates) {
    stop('.mode == "year" requires .dt to be a Date or POSIXct time.\n',
         'You can convert integer years to dates with ',
           '`as.Date(paste(year, "06", "15"), sep = "-")`')
  }

  # If we got dates or times, and `.mode == 'year'`, we want to scale to years,
  # otherwise we apply no scaling, and leave that to the user.
  if (.mode == 'year'&& inherits(the_date, 'Date')) {
      multiplier <- 365.2422
  }
  else if (.mode == 'year' && inherits(the_date, 'POSIXct')) {
    # 365 days, 5 hours, 48 minutes, and 46 seconds
    multiplier <- 31556926  #seconds per year, on average
    message("Annual trends based on POSIXct times may be affected by ",
            "rounding. Consider rexpressing time coordinates as R Dates.")
  }
  else {
    multiplier <- 1
  }

  # Reorder the data by the time stamp. Only essential if t_fit is FALSE
  sl <- sl[order(the_date)]
  the_date <- the_date[order(the_date)]

  year = as.numeric(format(the_date, format = '%Y'))
  yearspan = max(year) - min(year) + 1

  q_val = - qnorm(.ci/2)

  # We use generalized least squares so we can account for
  # autocorrelation in estimating the error of the slope.
  # The simpler model is for a complete or nearly complete time series. The
  # model based on the time coordinates is slower, and sometimes
  # crashes on large data sets.
  if (t_fit) {
    the_gls <- nlme::gls(sl ~ the_date,
                         correlation = nlme::corAR1(0.75, form = ~the_date))
  }
  else {
    the_gls <- nlme::gls(sl ~ the_date,
                         correlation = nlme::corAR1())
  }

  # TODO: Use direct computation rather than relying on `summary()$tTable`
  # everything except the p value can be accessed via `coef()` and `vcov()`.
  mod_sum <- as.data.frame(summary(the_gls)$tTable)
  EST <- mod_sum$Value[2] * multiplier  # convert to year
  SE <- mod_sum$Std.Error[2]   * multiplier
  P <-  mod_sum$`p-value`[2]
  low_CI <- EST - q_val*SE
  high_CI<- EST + q_val*SE
  cor_struct = c('Order-based', 'Time-based')[t_fit + 1]

  result <- structure(c(Estimate = EST, Std_Err = SE, P_Val = P,
              Lower_CI =  low_CI, Upper_CI = high_CI), CI_P = .ci,
              span = yearspan, start = min(year), end = max(year),
              cor_struct = cor_struct)
  return(result)
}