or Outliers

Documented in fit_AR1_Gaussian impute_AR1_Gaussian impute_rolling_AR1_Gaussian

#' @title Fit Gaussian AR(1) model to time series with missing values and/or outliers
#'
#' @description Estimate the parameters of a univariate Gaussian AR(1) model 
#'              to fit the given time series with missing values and/or outliers. 
#'              For multivariate time series, the function will perform a 
#'              number of individual univariate fittings without attempting 
#'              to model the correlations among the time series.
#'              If the time series does not contain missing values, the 
#'              maximum likelihood (ML) estimation is done in one shot.
#'              With missing values, the iterative EM algorithm is employed 
#'              for the estimation until converge is achieved.
#'
#' @param y Time series object coercible to either a numeric vector or numeric matrix 
#'          (e.g., \code{zoo} or \code{xts}) with missing values denoted by \code{NA}. 
#' @param random_walk Logical value indicating if the time series is assumed to be a random walk so that \code{phi1 = 1} 
#'                    (default is \code{FALSE}).
#' @param zero_mean Logical value indicating if the time series is assumed zero-mean so that \code{phi0 = 0} 
#'                  (default is \code{FALSE}).
#' @param remove_outliers Logical value indicating whether to detect and remove outliers.
#' @param outlier_prob_th Threshold of probability of observation to declare an outlier (default is \code{1e-3}).
#' @param verbose Logical value indicating whether to output messages (default is \code{TRUE}).
#' @param return_iterates Logical value indicating if the iterates are to be returned (default is \code{FALSE}).
#' @param return_condMeanCov Logical value indicating if the conditional mean and covariance matrix of the 
#'                           time series (excluding the leading and trailing missing values) given the observed data are to be returned (default is \code{FALSE}).
#' @param tol Positive number denoting the relative tolerance used as stopping criterion (default is \code{1e-8}).
#' @param maxiter Positive integer indicating the maximum number of iterations allowed (default is \code{100}).
#' 
#' @return If the argument \code{y} is a univariate time series (i.e., coercible to a numeric vector), then this 
#'         function will return a list with the following elements:
#' \item{\code{phi0}}{The estimate for \code{phi0} (real number).}
#' \item{\code{phi1}}{The estimate for \code{phi1} (real number).}
#' \item{\code{sigma2}}{The estimate for \code{sigma^2} (positive number).}
#' \item{\code{phi0_iterates}}{Numeric vector with the estimates for \code{phi0} at each iteration
#'                            (returned only when \code{return_iterates = TRUE}).}
#' \item{\code{phi1_iterates}}{Numeric vector with the estimates for \code{phi1} at each iteration
#'                            (returned only when \code{return_iterates = TRUE}).}
#' \item{\code{sigma2_iterates}}{Numeric vector with the estimates for \code{sigma^2} at each iteration
#'                              (returned only when \code{return_iterates = TRUE}).}
#' \item{\code{f_iterates}}{Numeric vector with the objective values at each iteration
#'                          (returned only when \code{return_iterates = TRUE}).}
#' \item{\code{cond_mean_y}}{Numeric vector (of same length as argument \code{y}) with the conditional mean of the time series 
#'                           (excluding the leading and trailing missing values)
#'                           given the observed data (returned only when \code{return_condMeanCov = TRUE}).}
#' \item{\code{cond_cov_y}}{Numeric matrix (with number of columns/rows equal to the length of the argument \code{y})
#'                          with the conditional covariance matrix of the time series (excluding the leading and trailing missing values) 
#'                          given the observed data (returned only when \code{return_condMeanCov = TRUE}).}
#' \item{\code{index_miss}}{Indices of missing values imputed.}
#' \item{\code{index_outliers}}{Indices of outliers detected/corrected.}
#' 
#' If the argument \code{y} is a multivariate time series (i.e., with multiple columns and coercible to a numeric matrix), 
#' then this function will return a list with each element as in the case of univariate \code{y} corresponding to each
#' of the columns (i.e., one list element per column of \code{y}), with the following additional elements that combine the 
#' estimated values in a convenient vector form:
#' \item{\code{phi0_vct}}{Numeric vector (with length equal to the number of columns of \code{y})
#'                        with the estimates for \code{phi0} for each of the univariate time series.}
#' \item{\code{phi1_vct}}{Numeric vector (with length equal to the number of columns of \code{y})
#'                        with the estimates for \code{phi1} for each of the univariate time series.}
#' \item{\code{sigma2_vct}}{Numeric vector (with length equal to the number of columns of \code{y})
#'                        with the estimates for \code{sigma2} for each of the univariate time series.}
#' 
#' @author Junyan Liu and Daniel P. Palomar
#' 
#' @seealso \code{\link{impute_AR1_Gaussian}}, \code{\link{fit_AR1_t}}
#' 
#' @references
#' R. J. Little and D. B. Rubin, Statistical Analysis with Missing Data, 2nd ed. Hoboken, N.J.: John Wiley & Sons, 2002.
#' 
#' J. Liu, S. Kumar, and D. P. Palomar, "Parameter estimation of heavy-tailed AR model with missing 
#' data via stochastic EM," IEEE Trans. on Signal Processing, vol. 67, no. 8, pp. 2159-2172, 15 April, 2019. 
#' 
#' @examples 
#' library(imputeFin)
#' data(ts_AR1_Gaussian)
#' y_missing <- ts_AR1_Gaussian$y_missing
#' fitted <- fit_AR1_Gaussian(y_missing)
#' 
#' @import zoo
#' @export
fit_AR1_Gaussian <- function(y, random_walk = FALSE, zero_mean = FALSE, remove_outliers = FALSE, outlier_prob_th = 1e-3, 
                                verbose = TRUE,
                                return_iterates = FALSE, return_condMeanCov = FALSE,
                                tol = 1e-8, maxiter = 100) {
  # error control
  if (!is.matrix(try(as.matrix(y), silent = TRUE))) stop("\"y\" must be coercible to a vector or matrix.")
  if (tol <= 0) stop("\"tol\" must be greater than 0.")
  if (maxiter < 1) stop("\"maxiter\" must be greater than 1.")
  
  # manage multiple columns
  if (NCOL(y) > 1) {
    estimation_list <- apply(y, MARGIN = 2, FUN = fit_AR1_Gaussian, random_walk, zero_mean, remove_outliers, outlier_prob_th, verbose = FALSE,
                                                                    return_iterates, return_condMeanCov, tol, maxiter)
    phi0_vct   <- unlist(lapply(estimation_list, function(x) x$phi0))
    phi1_vct   <- unlist(lapply(estimation_list, function(x) x$phi1))
    sigma2_vct <- unlist(lapply(estimation_list, function(x) x$sigma2))
    # output messages
    if (verbose)
      for (i in 1:length(estimation_list))
        message(names(estimation_list)[i], ": ", 
                length(estimation_list[[i]]$index_miss), " inner missing values and ", 
                length(estimation_list[[i]]$index_outliers), " outliers detected.")
    return(c(estimation_list, list("phi0_vct"   = phi0_vct,
                                   "phi1_vct"   = phi1_vct,
                                   "sigma2_vct" = sigma2_vct)))
  }
  
  #
  #   code for y single-column
  #
  # error control
  if (!is.numeric(y)) stop("\"y\" only allows numerical or NA values.")
  if (sum(!is.na(y)) < 5L) stop("Each time series in \"y\" must have at least 5 observations.")
  y_name <- colnames(y)  
  y <- as.numeric(y)

  # outlier detection and substitution with NAs
  if(remove_outliers) {
    fitted_with_outliers <- if (!any_inner_NA(y)) fit_AR1_Gaussian_complete(y[!is.na(y)], random_walk, zero_mean)
                            else fit_AR1_Gaussian(y, random_walk, zero_mean, remove_outliers = FALSE, outlier_prob_th, verbose = FALSE,
                                                  return_iterates, return_condMeanCov, tol, maxiter)
    idx_outliers <- find_outliers_AR1_Gaussian(y, fitted_with_outliers, outlier_prob_th)
    if (!is.null(idx_outliers))
      y[idx_outliers] <- NA  # substitute outliers with NAs
  }
  
  # remove the missing values at the head and tail of the time series since they do not affect the estimation result
  index_obs <- which(!is.na(y))
  y <- y[min(index_obs):max(index_obs)]
  idx_offset <- min(index_obs) - 1L
  index_obs <- which(!is.na(y))

  # estimation (after possibly setting outliers to NA)
  if (!anyNA(y))   # trivial case with no NAs
    results <- fit_AR1_Gaussian_complete(y, random_walk, zero_mean)
  else {
    # if there are NAs find the missing blocks
    n <- index_obs <- index_miss <- n_obs <- y_obs <- delta_index_obs <- n_block <- n_in_block <- 
      first_index_in_block <- last_index_in_block <- previous_obs_before_block <- next_obs_after_block <- NULL
    list2env(findMissingBlock(y), envir = environment())

    # objective function, the observed data log-likelihood
    if (return_iterates)
      obj <- function(phi0, phi1, sigma2) {
        sum_phi1 <- sum2_phi1 <- rep(NA, n_obs-1)
        for (i in 1:(n_obs-1)) {
          sum_phi1[i] <- sum(phi1^(0:(delta_index_obs[i] - 1)))
          sum2_phi1[i] <- sum(phi1^((0:(delta_index_obs[i] - 1))*2))
        }
        return(-sum((y_obs[2:n_obs] - phi1^delta_index_obs * y_obs[1:(n_obs - 1)] 
                     - phi0 * sum_phi1)^2 / (2 * sum2_phi1 * sigma2))
               -0.5 * (n_obs - 1) * log(sigma2) - 0.5 * sum(log(sum2_phi1)))
      }
    
    # initialize the estimates
    phi0 <- phi1 <- sigma2 <- f <- c()
    estimation_heuristic <- fit_AR1_Gaussian_heuristic(y, index_miss, random_walk, zero_mean)
    phi1[1] <- estimation_heuristic$phi1
    phi0[1] <- estimation_heuristic$phi0
    sigma2[1] <- estimation_heuristic$sigma2
    if (return_iterates) f[1] <- obj(phi0[1], phi1[1], sigma2[1])    
    
    # loop
    for (k in 1:maxiter) {
      # E-step
      # computation of mean and covariance of y conditional on all the observed data
      cond <- condMeanCov(y_obs, index_obs, n, n_block, n_in_block, 
                          first_index_in_block, last_index_in_block, previous_obs_before_block, next_obs_after_block, 
                          phi0[k], phi1[k], sigma2[k], full_cov = FALSE)
      # computation of sufficient statistics
      s_y2   <- sum(cond$mean_y[-1])
      s_y1   <- sum(cond$mean_y[-n])
      s_y2y2 <- sum(cond$cov_y_diag[-1] + cond$mean_y[-1]^2)  #slow version: s_y2y2 <- sum(diag(cond_cov_y[2:n, 2:n]) + cond_mean_y[2:n]^2)
      s_y1y1 <- sum(cond$cov_y_diag[-n]  + cond$mean_y[-n]^2)  #slow version: s_y1y1 <- sum(diag(cond_cov_y[1:(n - 1), 1:(n - 1)])  + cond_mean_y[1:(n - 1)]^2)
      s_y2y1 <- sum(cond$cov_y_diag1 + cond$mean_y[-1] * cond$mean_y[-n])  #slow version: s_y2y1 <- sum(diag(cond_cov_y[1:(n - 1), 2:n]) + cond_mean_y[2:n] * cond_mean_y[1:(n - 1)])
      
      # M-step (update the estimates)
      if (!random_walk && !zero_mean) {
        phi1[k + 1] <- (s_y2y1 - s_y2 * s_y1 / (n - 1)) / (s_y1y1 - s_y1 * s_y1 / (n - 1)) 
        phi0[k + 1] <- (s_y2 - phi1[k +1] * s_y1) / (n - 1)
      } else if (random_walk && !zero_mean){
        phi1[k + 1] <- 1
        phi0[k + 1] <- (s_y2 - s_y1) / (n - 1)
      } else if (!random_walk && zero_mean){
        phi1[k + 1] <- s_y2y1 / s_y1y1 
        phi0[k + 1] <- 0
      } else{
        phi1[k + 1] <- 1
        phi0[k + 1] <- 0
      }
      sigma2[k + 1] <- ((s_y2y2 + (n - 1) * phi0[k + 1]^2 + phi1[k + 1]^2 * s_y1y1 -
                           2 * phi0[k + 1] * s_y2 - 2 * phi1[k + 1] * s_y2y1 + 2 * phi0[k + 1] * phi1[k + 1] * s_y1) / (n - 1))
      
      # computation of the objective function
      if (return_iterates) f[k + 1] <- obj(phi0[k + 1], phi1[k + 1], sigma2[k + 1])
      
      # termination criterion
      if (abs(phi0[k + 1] - phi0[k]) <= tol * (abs(phi0[k + 1]) + abs(phi0[k]))/2
          && abs(phi1[k + 1] - phi1[k]) <= tol * (abs(phi1[k + 1]) + abs(phi1[k]))/2
          && abs(sigma2[k + 1] - sigma2[k]) <= tol * (abs(sigma2[k + 1]) + abs(sigma2[k]))/2) 
        break
    }
    
    # collect results to return
    results <- list("phi0"   = phi0[k + 1],
                    "phi1"   = phi1[k + 1],
                    "sigma2" = sigma2[k + 1])
    if (return_iterates) 
      results <- c(results, list("phi0_iterates"   = phi0,
                                 "phi1_iterates"   = phi1,
                                 "sigma2_iterates" = sigma2))
    if (return_condMeanCov) {
      cond <- condMeanCov(y_obs, index_obs, n, n_block, n_in_block, 
                          first_index_in_block, last_index_in_block, previous_obs_before_block, next_obs_after_block, 
                          phi0[k+1], phi1[k+1], sigma2[k+1], full_cov = TRUE)
      results <- c(results, list("cond_mean_y" = cond$mean_y,
                                 "cond_cov_y"  = cond$cov_y))
    }
  }
  results <- c(results, list("index_miss" = if (sum(is.na(y)) == 0) NULL
                                            else which(is.na(y)) + idx_offset))  
  if(!remove_outliers) idx_outliers <- NULL
  results <- c(results, list("index_outliers" = if(is.null(idx_outliers)) NULL
                                                else idx_outliers + idx_offset))
  # output message
  if (verbose)
    message(y_name, ": ", 
            length(results$index_miss), " inner missing values and ", 
            length(results$index_outliers), " outliers detected.") 
  return(results)
}




# trivial case with no NAs (no need for EM algorithm)
fit_AR1_Gaussian_complete <- function(y, random_walk = FALSE, zero_mean = FALSE) {
  if (anyNA(y)) stop("Function fit_AR1_Gaussian_complete() cannot accept NAs.")  
  
  n <- length(y)
  s_y2 <- sum(y[-1])
  s_y1 <- sum(y[-n])
  s_y2y2 <- sum(y[-1]^2) 
  s_y1y1 <- sum(y[-n]^2) 
  s_y2y1 <- sum(y[-1]*y[-n])
  if (!random_walk && !zero_mean) {
    phi1 <- (s_y2y1 - s_y2 * s_y1 / (n - 1)) / (s_y1y1 - s_y1 * s_y1 / (n - 1)) 
    phi0 <- (s_y2 - phi1 * s_y1) / (n - 1)
  } else if (random_walk && !zero_mean){
    phi1 <- 1
    phi0 <- (s_y2 - s_y1) / (n - 1)
  } else if (!random_walk && zero_mean){
    phi1 <- s_y2y1 / s_y1y1 
    phi0 <- 0
  } else{
    phi1 <- 1
    phi0 <- 0
  }
  sigma2 <- (( s_y2y2 + (n - 1) * phi0^2 + phi1^2 * s_y1y1 
               - 2 * phi0 * s_y2 - 2 * phi1 * s_y2y1 + 2 * phi0 * phi1 * s_y1 ) / (n - 1))
  if (!(sigma2 > 0))
    stop("Time series is constant and should be removed.")
  return(list("phi0"   = phi0,
              "phi1"   = phi1,
              "sigma2" = sigma2))
}




#' @title Impute missing values of time series based on a Gaussian AR(1) model
#'
#' @description Impute inner missing values (excluding leading and trailing ones) 
#'              of time series by drawing samples from the conditional distribution 
#'              of the missing values given the observed data based on a Gaussian 
#'              AR(1) model as estimated with the function \code{\link{fit_AR1_Gaussian}}. 
#'              Outliers can be detected and removed.
#' 
#' @inheritParams fit_AR1_Gaussian
#' @param n_samples Positive integer indicating the number of imputations (default is \code{1}).
#' @param return_estimates Logical value indicating if the estimates of the model parameters 
#'                         are to be returned (default is \code{FALSE}).
#'                         
#' @return By default (i.e., for \code{n_samples = 1} and \code{return_estimates = FALSE}), 
#'         the function will return an imputed time series of the same class and dimensions 
#'         as the argument \code{y} with one new attribute recording the locations of missing 
#'         values (the function \code{\link{plot_imputed}} will make use of such information
#'         to indicate the imputed values), as well as locations of outliers removed.
#'         
#'         If \code{n_samples > 1}, the function will return a list consisting of \code{n_sample} 
#'         imputed time series with names: y_imputed.1, y_imputed.2, etc. 
#'         
#'         If \code{return_estimates = TRUE}, in addition to the imputed time series \code{y_imputed}, 
#'         the function will return the estimated model parameters:
#'         \item{\code{phi0}}{The estimate for \code{phi0} (numeric scalar or vector depending 
#'                            on the number of time series).}
#'         \item{\code{phi1}}{The estimate for \code{phi1} (numeric scalar or vector depending 
#'                            on the number of time series).}
#'         \item{\code{sigma2}}{The estimate for \code{sigma2} (numeric scalar or vector depending 
#'                              on the number of time series).}
#' 
#' @author Junyan Liu and Daniel P. Palomar
#' 
#' @seealso \code{\link{plot_imputed}}, \code{\link{fit_AR1_Gaussian}}, \code{\link{impute_AR1_t}}
#' 
#' @references
#' R. J. Little and D. B. Rubin, Statistical Analysis with Missing Data, 2nd ed. Hoboken, N.J.: John Wiley & Sons, 2002.
#' 
#' J. Liu, S. Kumar, and D. P. Palomar, "Parameter estimation of heavy-tailed AR model with missing 
#' data via stochastic EM," IEEE Trans. on Signal Processing, vol. 67, no. 8, pp. 2159-2172, 15 April, 2019. 
#' 
#' @examples
#' library(imputeFin)
#' data(ts_AR1_Gaussian) 
#' y_missing <- ts_AR1_Gaussian$y_missing
#' y_imputed <- impute_AR1_Gaussian(y_missing)
#' plot_imputed(y_imputed)
#' 
#' @import zoo
#' @import MASS
#' @export
impute_AR1_Gaussian <- function(y, n_samples = 1, random_walk = FALSE, zero_mean = FALSE, 
                                remove_outliers = FALSE, outlier_prob_th = 1e-3, 
                                verbose = TRUE, return_estimates = FALSE, 
                                tol = 1e-10, maxiter = 100) { 
  # error control
  if (!is.matrix(try(as.matrix(y), silent = TRUE))) stop("\"y\" must be coercible to a vector or matrix.")
  if (round(n_samples)!=n_samples | n_samples<=0) stop("\"n_samples\" must be a positive integer.")

  # manage multiple columns  
  if (NCOL(y) > 1) {
    results_list <- lapply(c(1:NCOL(y)), FUN = function(i) {
      impute_AR1_Gaussian(y[, i, drop = FALSE], n_samples, random_walk, zero_mean, remove_outliers, outlier_prob_th, verbose = FALSE, 
                          return_estimates, tol, maxiter)
      })
    names(results_list) <- colnames(y)
    if (n_samples == 1 && !return_estimates) {  # return directly a matrix like y
      results <- do.call(cbind, results_list)
      attr(results, "index_miss")     <- lapply(results_list, FUN = function(res) attr(res, "index_miss"))
      attr(results, "index_outliers") <- lapply(results_list, FUN = function(res) attr(res, "index_outliers"))
    } else if (n_samples == 1 && return_estimates) {
      results <- do.call(mapply, c("FUN" = cbind, results_list, "SIMPLIFY" = FALSE))
      attr(results$y_imputed, "index_miss")     <- lapply(results_list, FUN = function(res) attr(res$y_imputed, "index_miss"))
      attr(results$y_imputed, "index_outliers") <- lapply(results_list, FUN = function(res) attr(res$y_imputed, "index_outliers"))
    } else {
      results <- do.call(mapply, c("FUN" = cbind, results_list, "SIMPLIFY" = FALSE))
      index_miss_list     <- lapply(results_list, FUN = function(res) attr(res$y_imputed.1, "index_miss"))
      index_outliers_list <- lapply(results_list, FUN = function(res) attr(res$y_imputed.1, "index_outliers"))
      for (i in 1:n_samples) {
        attr(results[[i]], "index_miss")     <- index_miss_list
        attr(results[[i]], "index_outliers") <- index_outliers_list
      }
      if (return_estimates) {
        results$phi0   <- as.vector(results$phi0)
        results$phi1   <- as.vector(results$phi1)
        results$sigma2 <- as.vector(results$sigma2)
      }
    }
    # output messages
    if (verbose)
      for (i in 1:length(results_list))
        message(names(results_list)[i], ": ", 
                length(attr(results_list[[i]], "index_miss")), " inner missing values imputed and ", 
                length(attr(results_list[[i]], "index_outliers")), " outliers detected and corrected.")    
    return(results)
  }
  
  #
  #   code for y single-column
  #  
  # error control
  if (!is.numeric(y)) stop("\"y\" only allows numerical or NA values.")
  if (sum(!is.na(y)) < 5) stop("Each time series in \"y\" must have at least 5 observations.")
  
  y_attrib <- attributes(y)
  y_name <- colnames(y)  
  y <- as.numeric(y)
  y_imputed <- matrix(rep(y, times = n_samples), ncol = n_samples)
  
  if (remove_outliers) {
    fitted <- fit_AR1_Gaussian(y, random_walk, zero_mean, remove_outliers = TRUE, outlier_prob_th = outlier_prob_th, verbose = FALSE, 
                               tol = tol, maxiter = maxiter)
    if (!is.null(index_outliers <- fitted$index_outliers))
      y[index_outliers] <- NA
  }  
  
  # imputation
  if (!any_inner_NA(y)) {  # trivial case with no inner NAs: do nothing
    if (return_estimates && !remove_outliers) 
      fitted <- fit_AR1_Gaussian(y, random_walk, zero_mean, remove_outliers = FALSE, verbose = FALSE, tol = tol, maxiter = maxiter)
  } else {
    fitted <- fit_AR1_Gaussian(y, random_walk, zero_mean, remove_outliers = FALSE, verbose = FALSE, 
                               return_condMeanCov = TRUE, tol = tol, maxiter = maxiter)
    index_obs <- which(!is.na(y))
    index_obs_min <- min(index_obs)
    index_miss_middle <- which(is_inner_NA(y))
    if (length(index_miss_middle) > 0) {
      index_miss_deleted <- index_miss_middle - (index_obs_min - 1)
      y_imputed[index_miss_middle, ] <- t(MASS::mvrnorm(n = n_samples, fitted$cond_mean_y[index_miss_deleted], fitted$cond_cov_y[index_miss_deleted, index_miss_deleted]))
    }
  }
  
  # prepare results
  index_miss <- which(is_inner_NA(y))
  if (length(index_miss) == 0) index_miss <- NULL
  if(!remove_outliers) index_outliers <- NULL
  if (n_samples == 1) {
    attributes(y_imputed) <- y_attrib
    attr(y_imputed, "index_miss") <- index_miss
    attr(y_imputed, "index_outliers") <- index_outliers
    results <- if (!return_estimates) y_imputed else list("y_imputed" = y_imputed)
  } else {
    y_imputed <-lapply(split(y_imputed, col(y_imputed)), FUN = function(x) { attributes(x) <- y_attrib
                                                                             attr(x, "index_miss") <- index_miss
                                                                             attr(x, "index_outliers") <- index_outliers
                                                                             return(x) })
    results <- c("y_imputed" = y_imputed)
  }
  if (return_estimates)  results <- c(results, list("phi0"   = fitted$phi0,
                                                    "phi1"   = fitted$phi1,
                                                    "sigma2" = fitted$sigma2))
  # output message
  if (verbose)
    message(y_name, ": ", 
            length(index_miss), " missing values imputed and ", 
            length(index_outliers), " outliers detected and corrected.")   
  return(results)
}





#' @title Impute missing values of time series on a rolling window basis based on a Gaussian AR(1) model
#'
#' @description Impute inner missing values (excluding leading and trailing ones) 
#'              of time series on a rolling window basis. This is a wrapper of the 
#'              function \code{\link{impute_AR1_Gaussian}}.
#' 
#' @inheritParams impute_AR1_Gaussian
#' @param rolling_window Rolling window length (default is \code{252}).
#' 
#' @return Same as \code{\link{impute_AR1_Gaussian}} for the case \code{n_samples = 1} 
#'         and \code{return_estimates = FALSE}.
#' 
#' @author Daniel P. Palomar
#' 
#' @seealso \code{\link{plot_imputed}}, \code{\link{impute_AR1_Gaussian}}
#' 
#' @examples
#' library(imputeFin)
#' data(ts_AR1_Gaussian) 
#' y_missing <- ts_AR1_Gaussian$y_missing
#' y_imputed <- impute_rolling_AR1_Gaussian(y_missing)
#' plot_imputed(y_imputed)
#' 
#' @export
impute_rolling_AR1_Gaussian <- function(y, rolling_window = 252,
                                        random_walk = FALSE, zero_mean = FALSE, 
                                        remove_outliers = FALSE, outlier_prob_th = 1e-3, 
                                        tol = 1e-10, maxiter = 100) {
  # each stock, impute block by block
  impute_single_stock <- function(y, rolling_window, random_walk, zero_mean, remove_outliers, outlier_prob_th, tol, maxiter) {
    idx_obs <- which(!is.na(y))
    if (length(idx_obs) == 0)
      T <- 0
    else
      T <- tail(idx_obs, 1) - idx_obs[1] + 1
    y_imputed <- y
    if (T > 1) {  # something to impute
      # generate block indices
      num_blocks <- max(1, floor(T/rolling_window))
      i_start <- idx_obs[1] - 1 + seq(1, T, by = rolling_window)[1:num_blocks]
      i_end   <- c(i_start[-1] - 1, tail(idx_obs, 1))
      # remove blocks that end/start in the middle of NAs
      blocks_to_remove <- union(which(is.na(y[i_start])), which(is.na(y[i_end])) + 1)
      if (length(blocks_to_remove) > 0) {
        i_start <- i_start[-blocks_to_remove]
        i_end   <- i_end[-(blocks_to_remove - 1)]
        num_blocks <- length(i_start)
      }
      #browser()
      idx_blocks_list <- lapply(1:num_blocks, function(i_block) seq(i_start[i_block], i_end[i_block]))
      # call impute function for each block
      y_imputed_list <- lapply(idx_blocks_list, 
                               function(idx) impute_AR1_Gaussian(y[idx, ], n_samples = 1,
                                                                 random_walk = random_walk, zero_mean = zero_mean, 
                                                                 remove_outliers = remove_outliers, outlier_prob_th = outlier_prob_th,
                                                                 verbose = FALSE, return_estimates = FALSE, 
                                                                 tol = tol, maxiter = maxiter))
      y_imputed[idx_obs[1]:tail(idx_obs, 1)] <- do.call(rbind, y_imputed_list)
    }
    return(y_imputed)
  }
  
  # impute stock by stock (column by column)
  y_imputed_list <- lapply(1:NCOL(y), FUN = function(i) {
    impute_single_stock(y[, i, drop = FALSE], rolling_window, random_walk, zero_mean, remove_outliers, outlier_prob_th, tol, maxiter)
  })
  names(y_imputed_list) <- colnames(y)
  y_imputed <- do.call(cbind, y_imputed_list)
  
  return(y_imputed)
}









#
# Compute the conditional mean and covariance matrix of y given the observed data and current estimates
#
condMeanCov <- function(y_obs, index_obs, n, n_block, n_in_block, 
                        first_index_in_block, last_index_in_block, previous_obs_before_block, next_obs_after_block, 
                        phi0, phi1, sigma2, full_cov = FALSE) {
  
  cond_mean_y <- rep(NA, n)  # mean of y conditional on all the observed data
  cond_mean_y[index_obs] <- y_obs
  if (full_cov) cond_cov_y <- matrix(0, nrow = n, ncol = n)  # covariance of y conditional on all the observed data
  cond_cov_y_diag <- rep(0, n)
  cond_cov_y_diag1 <- rep(0, n-1)
  
  for (d in 1:n_block) {  # for each missing block
    n_d <- n_in_block[d]  # number of missing values in the d-th missing block  
    index_d <- first_index_in_block[d]:last_index_in_block[d]  # indexes of missing values in the d-th missing block
    phi1_exp <- phi1^(0:(n_d+1))  # phi1[k] to the power of 0, 1, ..., (n_d+1)
    sum_phi1_exp <- cumsum(phi1_exp) 
    cond_mean_block_obs <- sum_phi1_exp[1:(n_d+1)] * phi0 + phi1_exp[2:(n_d+2)] * previous_obs_before_block[d]  # mean of the d-th missing block conditional on all the previous observed samples
    # computation of cond_cov_block_obs
    cond_cov_block_obs_diag <- c(1, rep(NA, n_d))
    for (i in 1:n_d)  cond_cov_block_obs_diag[i+1] <- cond_cov_block_obs_diag[i]*phi1^2 + 1
    cond_cov_block_obs_diag1 <- cond_cov_block_obs_diag[1:n_d] * phi1
    cond_cov_block_obs_lastcol <- cond_cov_block_obs_diag * rev(phi1_exp[1:(n_d+1)])
    if (full_cov) {
      cond_cov_block_obs <- matrix(nrow = n_d+1, ncol = n_d+1)  # covariance of the d-th missing block and the next observation conditional on all the previous observed samples  
      diag(cond_cov_block_obs) <- cond_cov_block_obs_diag
      for (i in 1:n_d)
        cond_cov_block_obs[i, (i+1):(n_d+1)] <-
        cond_cov_block_obs[(i+1):(n_d+1), i] <- cond_cov_block_obs_diag[i] * phi1_exp[2:(n_d+1 - i+1)]
      #sanity check
      #if (sum(abs(cond_cov_block_obs_lastcol - cond_cov_block_obs[, n_d+1])) > 1e-9 ||
      #    sum(abs(cond_cov_block_obs_diag1 - diag1(cond_cov_block_obs))) > 1e-9 ||
      #    sum(abs(cond_cov_block_obs_diag - diag(cond_cov_block_obs))) > 1e-9) {
      #  message("Error in computation of cond_cov_block_obs...")
      #  browser()
      #}
    }

    # mean of the d-th missing block conditional on all the observed data
    cond_mean_block <- cond_mean_block_obs[1:n_d] + cond_cov_block_obs_lastcol[1:n_d] / cond_cov_block_obs_lastcol[n_d+1] *
                                                    (next_obs_after_block[d] - cond_mean_block_obs[n_d+1])
    # covariance of the d-th missing block conditional on all the observed data
    if (full_cov) cond_cov_block <- sigma2 * (cond_cov_block_obs[1:n_d, 1:n_d] - tcrossprod(cond_cov_block_obs[1:n_d, n_d+1])/cond_cov_block_obs[n_d+1, n_d+1])  #slower: cond_cov_block <- sigma2 * (cond_cov_block_obs[1:n_d, 1:n_d] - cond_cov_block_obs[1:n_d, n_d+1] %*% t(cond_cov_block_obs[1:n_d, n_d+1])/cond_cov_block_obs[n_d+1, n_d+1])
    cond_cov_block_diag <- sigma2 * (cond_cov_block_obs_diag[-(n_d+1)] - cond_cov_block_obs_lastcol[1:n_d]^2/cond_cov_block_obs_lastcol[n_d+1])
    cond_cov_block_diag1 <- sigma2 * (cond_cov_block_obs_diag1[-n_d] - cond_cov_block_obs_lastcol[1:(n_d-1)]*cond_cov_block_obs_lastcol[2:n_d]/cond_cov_block_obs_lastcol[n_d+1])

    # update global mean and cov matrix
    cond_mean_y[index_d] <- cond_mean_block
    if (full_cov) cond_cov_y[index_d, index_d] <- cond_cov_block
    cond_cov_y_diag[index_d] <- cond_cov_block_diag  # cond_cov_y_diag[index_d] <- diag(cond_cov_block)
    cond_cov_y_diag1[index_d[-n_d]] <- cond_cov_block_diag1  # cond_cov_y_diag1[index_d[-n_d]] <- diag1(cond_cov_block)
  }
  # sanity check
  #if (full_cov)
  #  if (sum(abs(diag(cond_cov_y) - cond_cov_y_diag)) > 1e-9 ||
  #      sum(abs(diag1(cond_cov_y) - cond_cov_y_diag1)) > 1e-9) {
  #    message("Error in computation of cond_cov_y...")
  #    browser()
  #  }
  
  results <- list("mean_y"      = cond_mean_y,
                  "cov_y_diag"  = cond_cov_y_diag,
                  "cov_y_diag1" = cond_cov_y_diag1)
  if (full_cov) 
    results <- c(results, list("cov_y" = cond_cov_y))
  return(results)
}



#
# A heuristic method to compute the parameters of Gaussian AR(1) model from incomplete time series
#
fit_AR1_Gaussian_heuristic <- function(y, index_miss, random_walk = FALSE, zero_mean = TRUE) {
  
  index_miss_p <- c(0, index_miss, length(y) + 1)
  delta_index_miss_p <- diff(index_miss_p)
  index_delta_index_miss_p <- which(delta_index_miss_p > 2)
  n_obs_block <- length(index_delta_index_miss_p)  # number of observation blocks with more than 1 sample
  n_in_obs_block <- delta_index_miss_p[index_delta_index_miss_p] - 1  # number of observed samples in each qualified observation block
  
  m <- 0
  y_obs2 <- y_obs1 <- c()
  for (i in 1:n_obs_block) {
    y_obs1[(m + 1):(m + n_in_obs_block[i] - 1)] <- y[(index_miss_p[index_delta_index_miss_p[i]] + 1):(index_miss_p[index_delta_index_miss_p[i] + 1] - 2)]
    y_obs2[(m + 1):(m + n_in_obs_block[i] - 1)] <- y[(index_miss_p[index_delta_index_miss_p[i]] + 2):(index_miss_p[index_delta_index_miss_p[i] + 1] - 1)]
    m <- m + n_in_obs_block[i] - 1
  }
  n_y_obs1 <- length(y_obs1)
  
  s_y_obs1 <- sum(y_obs1)
  s_y_obs2 <- sum(y_obs2)
  s_y_obs1_square <- sum(y_obs1^2)
  s_y_obs2_square <- sum(y_obs2^2)
  s_y_obs2_y_obs1 <- sum(y_obs1 * y_obs2)
  
  # compute the estimates
  if (!random_walk && !zero_mean) {
    phi1 <- (s_y_obs2_y_obs1 - s_y_obs2 * s_y_obs1 / n_y_obs1) / (s_y_obs1_square - s_y_obs1 * s_y_obs1 / n_y_obs1) 
    phi0 <- (s_y_obs2 - phi1 * s_y_obs1) / n_y_obs1
  } else if (random_walk && !zero_mean){
    phi1 <- 1
    phi0 <- (s_y_obs2 - phi1 * s_y_obs1) / n_y_obs1
  } else if (!random_walk && zero_mean){
    phi1 <- s_y_obs2_y_obs1 / s_y_obs1_square 
    phi0 <- 0
  } else{
    phi1 <- 1
    phi0 <- 0
  }
  sigma2 <- (( s_y_obs2_square + n_y_obs1 * phi0^2 + phi1^2 * s_y_obs1_square 
               - 2 * phi0 * s_y_obs2 - 2 * phi1 * s_y_obs2_y_obs1 + 2 * phi0 * phi1 * s_y_obs1 ) / n_y_obs1)
  
  return(list("phi0"   = phi0,
              "phi1"   = phi1,
              "sigma2" = sigma2))
}



#
# find the missing blocks in a time series with missing values
#
findMissingBlock <- function(y) {
  n <- length(y)  # length of the time series
  index_obs <- which(!is.na(y))  # indexes of observed values
  index_miss <- setdiff(1:n, index_obs)  # indexes of missing values
  n_obs <- length(index_obs)
  y_obs <- y[index_obs]  # observed values
  delta_index_obs <- diff(index_obs)
  index_delta_index_obs <- which(delta_index_obs > 1)
  n_block <- length(index_delta_index_obs)  # number of missing blocks
  n_in_block <- delta_index_obs[index_delta_index_obs] - 1  # number of missing values in each block
  first_index_in_block <- index_obs[index_delta_index_obs] + 1  # index of the first missing value in each block
  last_index_in_block <- index_obs[index_delta_index_obs] + n_in_block  # index of the last missing value in each block
  previous_obs_before_block <- y[first_index_in_block - 1]  # previous observed value before each block
  next_obs_after_block <- y[last_index_in_block + 1]  # next observed value after each block
  
  return(list("n"                         = n,
              "index_obs"                 = index_obs,
              "index_miss"                = index_miss,
              "n_obs"                     = n_obs,
              "y_obs"                     = y_obs,
              "delta_index_obs"           = delta_index_obs,
              "n_block"                   = n_block,
              "n_in_block"                = n_in_block,
              "first_index_in_block"      = first_index_in_block,
              "last_index_in_block"       = last_index_in_block,
              "previous_obs_before_block" = previous_obs_before_block,
              "next_obs_after_block"      = next_obs_after_block))
}
Any scripts or data that you put into this service are public.
imputeFin documentation built on Feb. 20, 2021, 9:07 a.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
imputeFin
Imputation of Financial Time Series with Missing Values and/or Outliers

R/estimate_impute_AR1_Gaussian.R
In imputeFin: Imputation of Financial Time Series with Missing Values and/or Outliers

Defines functions findMissingBlock fit_AR1_Gaussian_heuristic condMeanCov impute_rolling_AR1_Gaussian impute_AR1_Gaussian fit_AR1_Gaussian_complete fit_AR1_Gaussian

Documented in fit_AR1_Gaussian impute_AR1_Gaussian impute_rolling_AR1_Gaussian

Try the imputeFin package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

imputeFin Imputation of Financial Time Series with Missing Values and/or Outliers

R/estimate_impute_AR1_Gaussian.R In imputeFin: Imputation of Financial Time Series with Missing Values and/or Outliers

Defines functions findMissingBlock fit_AR1_Gaussian_heuristic condMeanCov impute_rolling_AR1_Gaussian impute_AR1_Gaussian fit_AR1_Gaussian_complete fit_AR1_Gaussian

Documented in fit_AR1_Gaussian impute_AR1_Gaussian impute_rolling_AR1_Gaussian

Try the imputeFin package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

imputeFin
Imputation of Financial Time Series with Missing Values and/or Outliers

R/estimate_impute_AR1_Gaussian.R
In imputeFin: Imputation of Financial Time Series with Missing Values and/or Outliers