tapas: Trend And PeAkS analysis in paleo-ecological time series

Documented in cpts_ar

#' Identifying changes in mean and variance of accumulation-rate records
#'
#' @description
#' Determines zone boundaries for single proxy accumulation-rate records based
#'   on change-point analysis and checks for the influence of
#'   sediment-accumulation rates on the detected change points.
#'
#' @param serie A list returned from the \code{pretreatment_data()}
#'        function. NB: the \code{pretreatment_data()} function must be run
#'        with the argument \code{out = "accI"}.
#' @param series_name A character string giving the name of the data set,
#'        or typically the site name (optional).
#' @param proxy A character string giving the name (typically the column name)
#'        of the variable that should be analysed.
#' @param n_rand Positive integer giving the number of iterations used to check
#'        for change-points in random data sets (by default
#'        \code{n_rand = 1000}).
#' @param bootstrap Logical. Determines how the random data sets are generated.
#'        If TRUE, then each of the \code{n_rand} random data sets are
#'        generated by random sampling with replacement from the input data.
#'        If FALSE, then the code generates in each of the \code{n_rand}
#'        iterations a vector of random concentration values having a
#'        user-specified distribution (based on \code{rand_ds_distr}).
#' @param rand_ds_distr A character string giving the distribution used
#'        to generate the random data sets when \code{bootstrap = FALSE}.
#'        Currently only "uniform" is supported (default). The random values
#'        range between the minimum and the maximum concentration values of
#'        the re-sampled series.
#' @param max_cpts Positive integer giving the maximum number of change points
#'        to search for when \code{meth_cpt = "BinSeg"}. By default
#'        \code{max_cpts = 50}).
#' @param n_screen Positive number giving the minimum frequency of occurrence of
#'        change-points in the random data sets to be validated. By default,
#'        \code{n_screen = 0.025}. Thus, with \code{n_rand = 1000} this equals
#'        a 2.5% chance of occurrence.
#' @param tau Positive integer giving the tolerance search window (in years)
#'        for change points in the 'true proxy-accumulation rate' record. By
#'        default \code{tau} = the number of years the raw record was re-sampled
#'        to (as of variable \code{yr.interp} that is written in the list
#'        generated with the \code{pretreatment_data()} function).
#' @param meth_cpt As of \code{changepoint::cpt.meanvar()}: choice of
#'        "AMOC", "PELT", "BinSeg", or "SegNeigh". But note that "SegNeigh" is
#'        computationally slow, use other methods instead.
#' @param pen_method Refer to \code{changepoint::cpt.meanvar()} for
#'        argument \code{penalty}. By default, \code{pen_method = "Manual"}.
#' @param pen_val Refer to \code{changepoint::cpt.meanvar()} for the
#'        argument \code{pen.value}. By default, \code{pen_val = "4*log(n)"},
#'        where \code{n} is equal to the number of observations.
#' @param t_stat Refer to \code{changepoint::cpt.meanvar()} for the
#'        argument \code{test.stat}. Currently only "Normal" is supported
#'        (default).
#' @param minseglen Positive integer giving the minimum number of observations
#'        within zones. By default \code{minseglen = 2}, see also
#'        \code{changepoint::cpt.meanvar()}.
#'
#' @details This function is used to determine if any of the change points of
#'        a proxy-accumulation record is strongly influenced by the modeled
#'        sediment-accumulation rates. The function performs a change-point
#'        analysis with the 'true' proxy-accumulation record (based on
#'        the effective concentration values), and in addition it performs a
#'        change-point analysis on a set (by default \code{n_rand = 1000}) of
#'        non-sense proxy-accumulation records (based on randomly drawn
#'        concentration values). The change points of the latter records
#'        (hereafter, non-sense change points) are then first
#'        screened based on their frequency of occurrence (\code{n_screen}, as
#'        some change points may occur by chance). Thereafter, the change points
#'        are screened further to retain
#'        those that separate zones where the mean proxy-accumulation rate
#'        changes in the same direction as the mean sediment-accumulation rate.
#'        The rationale being that if a change point of the 'true
#'        proxy-accumulation record' matches a non-sense change point
#'        (with a tolerance search window equal to \code{tau}), and
#'        if both records change in the same direction (say, both the
#'        proxy-accumulation rate and the sediment-accumulation rate increase
#'        across a change point), then the change point of the
#'        proxy-accumulation rate may be strongly influenced by
#'        sediment-accumulation rates (proxy_ar = proxy_conc * sed_ar).
#'
#' @returns
#' a list with the depths and ages of the change points, the mean values
#'        for the zones, as well as the settings of the change-point analysis.
#'
#' In addition, a diagnostic plot consisting of three panels is sent to the
#'   device.
#' The plot includes (from top to bottom):
#' - the re-sampled concentration values,
#' - the sediment-accumulation rate, and
#' - the proxy accumulation-rate record.
#'
#' The latter panel may show:
#' - change points of the proxy accumulation-rate record (blue vertical lines),
#' - mean proxy accumulation-rate values for the zones (red horizontal lines),
#' - non-sense change points that match 'true change points' (red circles),
#' - non-sense change points that do not match 'true change points' (green
#'    circles),
#'
#' @seealso \code{\link{cpt.meanvar}}, \code{\link{pretreatment_data}}
#'
#' @references
#' Finsinger W, Magyari EK, Fevre J, Orban I, Pal I, Vincze I,
#'    Hubay K, Birks HH, Braun M, Toth M (2016) Holocene fire regimes near the
#'    treeline in the Retezat Mts. (Southern Carpathians). Quaternary
#'    International, *477*, 94-105. 10.1016/j.quaint.2016.04.029.
#'
#' Killick R, Eckley IA (2014). changepoint: An R Package for Changepoint
#'    Analysis. Journal of Statistical Software, *58*(3), 1-19.
#'
#' @examples
#' \dontrun{
#' co <- tapas::co_char_data
#' co_i <- tapas::pretreatment_data(co, out = "accI")
#'
#' ## change-point analysis with default options
#' co_i_cpts <- cpts_ar(co_i, proxy = "char")
#'
#' ## change-point analysis with higher penalty value
#' co_i_cpts_8logn <- cpts_ar(co_i, proxy = "char", pen_val = "8*log(n)")
#'
#' ## Change-point analysis with the 'red noise' record (no change point)
#' rdn <- tapas::red_noise
#' rdn_i <- tapas::pretreatment_data(rdn)
#' rdn_i_cpts <- tapas::cpts_ar(rdn_i, proxy = "char")
#'
#'
#' ## As in the example above, but introducing a step-wise increase in the
#' ## sediment-accumulation rate [cm/yr]
#' ## NB: sar = 1/sediment-deposition time [yr/cm]
#' sdt <- c(rep_len(10, length.out = 100), rep_len(25, length.out = 360))
#' a_bot <- cumsum(sdt)
#' rdn2 <- rdn[1:length(a_bot), ]
#' rdn2$age_bot <- a_bot
#' rdn2$age_top <- a_bot - sdt
#' rdn2_i <- tapas::pretreatment_data(rdn2, yrInterp = 25)
#' rdn2_i_cpts <- tapas::cpts_ar(rdn2_i, proxy = "char")
#' }
#'
#' @author Walter Finsinger
#'
#' @importFrom changepoint cpt.meanvar cpts
#' @importFrom graphics segments
#' @importFrom stats approx runif
#'
#' @export cpts_ar

cpts_ar <- function(serie, series_name = NULL, proxy = NULL, n_rand = 1000,
                    bootstrap = FALSE, rand_ds_distr = "uniform",
                    max_cpts = 50, n_screen = 0.025, tau = NULL,
                    meth_cpt = "BinSeg",
                    pen_method = "Manual", pen_val = "4*log(n)",
                    t_stat = "Normal", minseglen = 2) {


  # Initial check up ----------------------------------------------------------

  ## Put the user’s layout settings back in place when the function is done
  opar <- par("mfrow", "mar", "oma", "cex")
  on.exit(par(opar))

  ## Check if input list contains proxy accumulation-rate values
  if (serie$out != "accI") {
    stop("Fatal error: set argument out = 'accI' when running the",
         "tapas::pretreatment_data() function.")
  }

  ## Check if the variable name (proxy) is coherent
  proxies_options <- colnames(serie$raw$series[ 2:dim(serie$raw$series)[2] ])

  if (is.null(proxy) & length(proxies_options) > 1) {
    stop(paste0("Fatal error: Unrecognized variable name (proxy = ). ",
                "Accepted names: ", paste(proxies_options, collapse = ", "), "."
    ))
  }
  if (is.null(proxy) & length(proxies_options) == 1) {
    proxy <- colnames(serie$raw$series) [2]
  }
  if (!(proxy %in% proxies_options)) {
    stop("Unrecognized variable name: '", proxy, "'. ",
         "Accepted names: ", paste(proxies_options, collapse = ", "), ".")
  }

  # Check if other arguments are coherent
  if (bootstrap == FALSE & !rand_ds_distr == "uniform") {
    stop("If bootstrap == FALSE, set rand_ds_distr = 'uniform'")
  }

  if (is.null(tau) == TRUE) {
    tau <- serie$int$yr.interp
  }

  # Extract variables from input list -----------------------------------------
  cm_i <- serie$int$cmI
  age_i <- serie$int$series.int$age
  conc_i <- serie$int$series.conI[[proxy]]
  ar_i <- serie$int$series.int[[paste0(proxy, "AR")]]

  serie_length <- dim(serie$int$series.int)[1]
  conc_i_range <- c(min(conc_i), max(conc_i))

  ## Get sediment-accumulation rates of the pre-treated series
  sed_ar_i <- c(diff(cm_i) / diff(age_i), NA)

  # Fill in some missing values by approximation
  sed_ar_i <- stats::approx(cm_i, sed_ar_i, xout = cm_i, rule = 2)$y

  # If there are still NA values left
  if (any(is.na(sed_ar_i))) {
    sed_ar_i_na <- which(is.na(sed_ar_i))
    sed_ar_i[sed_ar_i_na] <- sed_ar_i[min(sed_ar_i_na) - 1]
  }


  # Start the Change-point analysis -------------------------------------------

  ## Prepare space to store data generated in each iteration of the loop
  cpts_rand <- matrix()

  #### CPT analysis of random data sets ---------------------------------------
  for (i in 1:n_rand) {      # Loop cpt analyses through rand data sets

    ## With bootstrapped Concentration values
    if (bootstrap == TRUE) {
      conc_rand <- sample(conc_i, size = serie_length, replace = T)
    }

    ## With random concentration values generated from a
    ## theoretical distribution
    if (bootstrap == FALSE) {

      ## Generate random record from a uniform distribution
      if (rand_ds_distr == "uniform") {
        conc_rand <- stats::runif(n = serie_length,
                                  min = conc_i_range[1], max = conc_i_range[2])
      }
    }

    ## Calculate proxy-accumulation rates based on the sediment-accumulation
    ## rate from the depth-age model
    rand_ar_i <- conc_rand * sed_ar_i

    ### Run change-point analysis with the random proxy-accumulation rate
    cpt_ar_rand <- changepoint::cpt.meanvar(rand_ar_i,
                                            method = meth_cpt,
                                            Q = max_cpts,
                                            test.stat = t_stat,
                                            penalty = pen_method,
                                            pen.value = pen_val,
                                            minseglen = minseglen)
    cpts_rand <- c(cpts_rand, changepoint::cpts(cpt_ar_rand))
  }


  ## Screen cpts from rand series ---------------------------------------------
  # Keep only those that occur more often than (n_rand * n_screen)
  cpts_rand_smr <- tabulate(cpts_rand, nbins = max(1, cpts_rand, na.rm = T))
  cpts_rand <- which(cpts_rand_smr > (n_rand * n_screen))


  ## Run cpt analysis with true CHAR record -----------------------------------
  cpt_ar <- changepoint::cpt.meanvar(ar_i,
                                     method = meth_cpt,
                                     Q = max_cpts,
                                     test.stat = t_stat,
                                     penalty = pen_method,
                                     pen.value = pen_val,
                                     minseglen = minseglen)

  # Gather data from the cpt_ar list
  cpts_positions <- changepoint::cpts(cpt_ar)
  cpts_breaks <- c(1, changepoint::cpts(cpt_ar), length(ar_i))
  cpts_depths <- cm_i[changepoint::cpts(cpt_ar)]
  cpts_ages <- age_i[changepoint::cpts(cpt_ar)]
  cpts_means <- cpt_ar@param.est$mean

  # Prepare a data.frame with the means and ages of the change points in the
  # 'true' proxy-accumulation record for each zone
  cpts_means_df <- data.frame(c(min(age_i), cpts_ages),
                              c(cpts_ages, max(age_i)),
                              cpts_means)
  colnames(cpts_means_df) <- c("top_age", "bot_age", "mean")


  # Further screen the cpts_rand considering the following conditions:
  # - they should occur around the true change points;
  # - the sediment-accumulation rate and the proxy-accumulation rate should
  #    change in the same direction:

  cpts_rand_val <- NULL # create an empty vector

  # Check every cpts_rand if it occurs within a search window of +/- tau years
  # from the 'true' cpts:
  if (length(cpts_rand) > 0) {
    for (z in 1:length(cpts_rand)) {
      # Check if the cpts_rand[z] occurs in the true record
      cpts_rand_z <- (age_i[cpts_rand[z]] < (age_i[cpts_positions] + tau) &
                        age_i[cpts_rand[z]] > (age_i[cpts_positions] - tau))
      if (any(cpts_rand_z) == TRUE) {
        cpts_sel <- which(cpts_rand_z == TRUE)
        # if (length(cpts_sel) > 1) {
        #   stop('')
        #
        # }
        # get the mean proxy-accumulation rates for the zones preceding and
        #   following the cpt
        v_pre_change <- cpts_means[cpts_sel + 1]
        v_post_change <- cpts_means[cpts_sel]
        # get the mean sed_ar_i for the two zones
        sed_ar_pre_change <- mean(
          sed_ar_i[cpts_breaks[cpts_sel + 1]:cpts_breaks[cpts_sel + 2]])
        sed_ar_post_change <- mean(
          sed_ar_i[cpts_breaks[cpts_sel]:cpts_breaks[cpts_sel + 1]])
        # Check if v and sed_ar change in the same direction across the cpt
        v_sign <- sign(v_pre_change - v_post_change)
        sed_ar_sign <- sign(sed_ar_pre_change - sed_ar_post_change)
        if (v_sign * sed_ar_sign > 0) {
          cpts_rand_val[z] <- cpts_rand[z]
        }
      }
    }
  }

  # Make diagnostic plot  -----------------------------------------------------
  par(mfrow = c(3,1), mar = c(0.5,5,0.5,1), oma = c(4,1,3,1), cex = 0.9)
  x_lim <- rev(range(c(min(age_i), max(age_i))))
  y_lim <- c(0, 1.1*max(ar_i))
  plot(age_i, conc_i, type = "s",
       ylab = paste0(proxy, " concentration\nresampled"),
       xlim = x_lim, axes = FALSE)
  axis(1, labels = FALSE, at = seq(0, max(age_i), 1000))
  axis(2)
  plot(age_i, sed_ar_i, type = "l",
       ylab = "Sediment-accumulation\nrate (cm/yr)", xaxt = "n",
       xlim = x_lim, ylim = c(0, 1.1*max(sed_ar_i)), lwd = 1.5,
       axes = FALSE)
  axis(1, labels = FALSE, at = seq(0, max(age_i), 1000))
  axis(2)
  plot(age_i, ar_i, type = "s", xlim = x_lim, ylim = y_lim,
       ylab = paste(proxy, "AR"), axes = FALSE)
  axis(1, labels = TRUE, at = seq(0, max(age_i), 1000))
  axis(2)
  graphics::segments(x0 = cpts_means_df$top_age, y0 = cpts_means_df$mean,
                     x1 = cpts_means_df$bot_age, y1 = cpts_means_df$mean,
                     col = "red", lwd = 2)
  abline(v = cpts_ages, lwd = 2, col = "blue")
  points(age_i[cpts_rand],
         rep_len(max(ar_i), length.out = length(cpts_rand)),
         col = "green")
  points(age_i[cpts_rand_val],
         rep_len(max(ar_i), length.out = length(cpts_rand_val)),
         col = "red", pch = 19)
  mtext(("Age (cal yrs BP)"), side = 1, line = 2.5, cex = 0.9)


  # Return output -------------------------------------------------------------
  output <- structure(list(depths_cpts = cpts_depths,
                           ages_cpts = cpts_ages,
                           cpts_means = cpts_means_df,
                           series_name = series_name, proxy = proxy,
                           n_screen = n_screen, tau = tau,
                           pen_val = pen_val,
                           meth_cpt = meth_cpt,
                           pen_method = pen_method,
                           t_stat = t_stat,
                           minseglen = minseglen))
  class(output) <- "tapas_cpts"
  return(output)
}