library(knitr)
opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.align = "center",
  fig.retina = 2,
  out.width = "85%",
  dpi = 96,
  pngquant = "--speed=1"
)
knitr_in_progress <- isTRUE(getOption('knitr.in.progress'))
knit_hooks$set(pngquant = hook_pngquant)
# rmarkdown::render("vignettes/CovarianceEstimationHeavyTail.Rmd", output_format = "rmarkdown::html_vignette", params = list(num_realiz = 100))

This vignette illustrates the usage of the package fitHeavyTail to estimate the mean vector and covariance matrix of heavy-tailed multivariate distributions such as the angular Gaussian, Cauchy, or Student's $t$ distribution. The results are compared against existing benchmark functions from different packages.

\newcommand{\bm}[1]{\boldsymbol{#1}} \newcommand{\textm}[1]{\textsf{#1}} \def\T{{\mkern-1mu\mathsf{T}}}

Installation

The package can be installed from CRAN or GitHub:

# install stable version from CRAN
install.packages("fitHeavyTail")

# install development version from GitHub
devtools::install_github("convexfi/fitHeavyTail")

To get help:

library(fitHeavyTail)
help(package = "fitHeavyTail")
?fit_mvt

To cite fitHeavyTail in publications:

citation("fitHeavyTail")

Quick Start

To illustrate the simple usage of the package fitHeavyTail, let's start by generating some multivariate data under a Student's $t$ distribution with significant heavy tails (degrees of freedom $\nu=4$):

library(mvtnorm)  # package for multivariate t distribution
N <- 10   # number of variables
T <- 80   # number of observations
nu <- 4   # degrees of freedom for heavy tails

set.seed(42)
mu <- rep(0, N)
U <- t(rmvnorm(n = round(0.3*N), sigma = 0.1*diag(N)))
Sigma_cov <- U %*% t(U) + diag(N)  # covariance matrix with factor model structure
Sigma_scatter <- (nu-2)/nu * Sigma_cov
X <- rmvt(n = T, delta = mu, sigma = Sigma_scatter, df = nu)  # generate data

We can first estimate the mean vector and covariance matrix via the traditional sample estimates (i.e., sample mean and sample covariance matrix):

mu_sm     <- colMeans(X)
Sigma_scm <- cov(X)

Then we can compute the robust estimates via the package fitHeavyTail:

library(fitHeavyTail)
fitted <- fit_mvt(X)

We can now compute the estimation errors and see the significant improvement:

sum((mu_sm     - mu)^2)
sum((fitted$mu - mu)^2)

sum((Sigma_scm  - Sigma_cov)^2)
sum((fitted$cov - Sigma_cov)^2)
# fitting with factor model
fitted_3factors <- fit_mvt(X, factors = 3)
sum((fitted_3factors$mu - mu)^2)
sum((fitted_3factors$cov - Sigma_cov)^2)

To get a visual idea of the robustness, we can plot the shapes of the covariance matrices (true and estimated ones) on two dimensions. Observe how the heavy-tailed estimation follows the true one more closely than the sample covariance matrix:

# fig.cap="Sample covariance matrix vs robust estimator."
library(mvtnorm)
library(ellipse)
library(ggplot2)
my_colors <- c("#2980b9", "#c0392b", "#16a085", "#f39c12", "#a29bfe", "#B33771", "#C4E538", "#273c75", "#95a5a6", "#222f3e")


N <- 2
T <- 50
nu <- 4
mu <- rep(0, N)
Sigma <- matrix(c(0.00125, 0.00112, 0.00112, 0.00125), N, N)

set.seed(42)
X <- rmvt(n = 200, delta = mu, sigma = (nu-2)/nu * Sigma, df = nu)
X_ <- X[1:T, ]
Sigma_scm       <- cov(X_)
Sigma_heavytail <- fit_mvt(X_, optimize_mu = FALSE, nu = 4, scale_covmat = TRUE)$cov
colnames(Sigma_heavytail) <- rownames(Sigma_heavytail) <- NULL

# scatter plot
ggplot(data.frame(x = X[, 1], y = X[, 2]), aes(x, y)) +
  geom_point(alpha = 1, size = 0.8) +
  geom_path(data = data.frame(ellipse(Sigma)),           aes(x, y, col = "1"), linewidth = 1) +
  geom_path(data = data.frame(ellipse(Sigma_scm)),       aes(x, y, col = "2"), linewidth = 1) +
  geom_path(data = data.frame(ellipse(Sigma_heavytail)), aes(x, y, col = "3"), linewidth = 1) +
  coord_cartesian(xlim = c(-0.15, 0.15), ylim = c(-0.15, 0.15)) +
  scale_color_manual(name = "ellipses",
                     values = c("1" = "black", "2" = "#c0392b", "3" = "#2980b9"),
                     labels = c("true", "SCM estimation", "heavy-tailed estimation")) +
  labs(title = "Shape of covariance matrices", x = NULL, y = NULL)

Numerical Comparison with Existing Packages

In the following, we generate multivariate heavy-tailed Student's $t$ distributed data and compare the performance of many different existing packages via r params$num_realiz Monte Carlo simulations in terms of estimation accurary, measured by the mean squared error (MSE) and CPU time.

library(mvtnorm)
library(fitHeavyTail)
library(parallel)  # detectCores(logical = FALSE)
library(tictoc)
library(ggplot2)
library(ellipse)
library(ggforce)  # for geom_ellipse()
library(reshape2)
library(dplyr)
# library(RColorBrewer)  # display.brewer.all(colorblindFriendly = TRUE)
library(latex2exp)
library(mvtnorm)

N <- 20
nu <- 4
mu <- rep(0, N)

set.seed(42)
U <- t(rmvnorm(n = round(0.3*N), sigma = 0.1*diag(N)))
Sigma_cov <- U %*% t(U) + diag(N)
Sigma_scatter <- (nu-2)/nu * Sigma_cov
# qplot(eigen(Sigma_cov)$values, geom = "histogram", xlab = "eigenvalues", fill = I("cyan"), col = I("black"),
#       main = "Histogram of eigenvalues of true covariance matrix")
library(fitHeavyTail)
library(tictoc)
library(parallel)  # detectCores(logical = FALSE)

MSE <- function(Sigma_hat) sum((Sigma_hat - Sigma_cov)^2)

eval_single <- function(X) {
  MSE <- time <- list()

  name        <- "stats::cov"
  time[name]  <- system.time({Sigma_hat <- cov(X)})["elapsed"]
  MSE[name]   <- MSE(Sigma_hat)

  name        <- "QRM::fit.mst"
  time[name]  <- system.time({Sigma_hat <- tryCatch(as.matrix(QRM::fit.mst(X, method = "BFGS", nit = 100, tol = 1e-6)$covariance),
                                                    warning = function(w) return(NA), 
                                                    error   = function(e) return(NA))})["elapsed"]
  MSE[name]   <- MSE(Sigma_hat)

  name        <- "MASS::cov.trov (nu=6)"
  time[name]  <- system.time({Sigma_hat <- MASS::cov.trob(X, nu = 6)$cov})["elapsed"]
  MSE[name]   <- MSE(Sigma_hat)

  name        <- "MASS::cov.mve"
  time[name]  <- system.time({Sigma_hat <- MASS::cov.mve(X)$cov})["elapsed"]
  MSE[name]   <- MSE(Sigma_hat)

  name        <- "robustbase::covMcd"
  time[name]  <- system.time({Sigma_hat <- suppressWarnings(robustbase::covMcd(X)$cov)})["elapsed"]
  MSE[name]   <- MSE(Sigma_hat)

  name        <- "robust::covRob"
  time[name]  <- system.time({Sigma_hat <- tryCatch(robust::covRob(X, estim = "pairwiseQC")$cov,  # also: "weighted", "M", "pairwiseGK"
                                                    warning = function(w) return(NA), 
                                                    error   = function(e) return(NA))})["elapsed"]
  MSE[name]   <- MSE(Sigma_hat)

  name        <- "covRobust::cov.nnve"
  time[name]  <- system.time({Sigma_hat <- covRobust::cov.nnve(X)$cov})["elapsed"]
  MSE[name]   <- MSE(Sigma_hat)

  name        <- "rrcov::CovMrcd"
  time[name]  <- system.time({Sigma_hat <- rrcov::CovMrcd(X)$cov})["elapsed"]
  MSE[name]   <- MSE(Sigma_hat)

  name        <- "sn::selm (nu=6)"
  time[name]  <- system.time({
    Sigma_hat <- 6/(6-2)*sn::mst.mple(x = matrix(1, nrow = nrow(X)), y = X, fixed.nu = 6, symmetr = TRUE)$dp.complete$Omega
    })["elapsed"]
  MSE[name]   <- MSE(Sigma_hat)

  name        <- "fitHeavyTail::fit_Tyler"
  time[name]  <- system.time({Sigma_hat <- fitHeavyTail:::fit_Tyler(X)$cov})["elapsed"]
  MSE[name]   <- MSE(Sigma_hat)

  name        <- "fitHeavyTail::fit_Cauchy"
  time[name]  <- system.time({Sigma_hat <- fitHeavyTail:::fit_Cauchy(X)$cov})["elapsed"]
  MSE[name]   <- MSE(Sigma_hat)

  name        <- "fitHeavyTail::fit_mvt (nu=6)"
  time[name]  <- system.time({Sigma_hat <- fitHeavyTail::fit_mvt(X, nu = 6)$cov})["elapsed"]
  MSE[name]   <- MSE(Sigma_hat)

  name        <- "fitHeavyTail::fit_mvt"
  time[name]  <- system.time({Sigma_hat <- fitHeavyTail::fit_mvt(X, nu = "kurtosis")$cov})["elapsed"]
  MSE[name]   <- MSE(Sigma_hat)

  return(list("MSE" = MSE, "time" = time))
}

T_sweep <- round(seq(from = ceiling(1.5*N), to = 5*N, length.out = 12))
if (!knitr_in_progress) pbar <- txtProgressBar(min = it <- 0, max = length(T_sweep), style = 3)
res_all_T <- list()
#tic(sprintf("Total execution time for %d Monte-Carlo realizations:", params$num_realiz))
for(T in T_sweep) {
  if (!knitr_in_progress) setTxtProgressBar(pbar, it <- it + 1)
  # first, generate random heavy-tailed data sequentially for reproducibility
  X_list <- replicate(params$num_realiz, rmvt(n = T, delta = mu, sigma = Sigma_scatter, df = nu), simplify = FALSE)
  names(X_list) <- paste0("realiz ", 1:params$num_realiz)

  # then, run estimations for all realizations in parallel (https://bookdown.org/rdpeng/rprogdatascience/parallel-computation.html)
  res <- mclapply(X_list, eval_single, mc.cores = 4)
  #res <- lapply(X_list, eval_single)

  # finally, keep track of MSEs and times
  res_all_T <- c(res_all_T, list(res))
}
#toc()
names(res_all_T) <- T_sweep
methods_names <- names(res_all_T[[1]][[1]][[1]])

save.image(file = "lala.RData")
#library(foreach)
#library(doParallel)
  # https://bookdown.org/rdpeng/rprogdatascience/parallel-computation.html
  # https://nceas.github.io/oss-lessons/parallel-computing-in-r/parallel-computing-in-r.html
  # https://docs.microsoft.com/en-us/machine-learning-server/r/how-to-revoscaler-distributed-computing-foreach
  # http://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/parallel.html

# other options for paralellizations:
for(T in T_sweep) {
  if (!knitr_in_progress) setTxtProgressBar(pbar, it <- it + 1)
  # first, generate random heavy-tailed data sequentially for reproducibility
  #X <- lapply(1:params$num_realiz, function(idx) {rmvt(n = T, delta = mu, sigma = Sigma_scatter, df = nu)})
  X <- replicate(params$num_realiz, rmvt(n = T, delta = mu, sigma = Sigma_scatter, df = nu), simplify = FALSE)
  #X <- foreach(1:params$num_realiz) %do% rmvt(n = T, delta = mu, sigma = Sigma_scatter, df = nu)

  # then, run the estimation in parallel 
  #res <- lapply(X, eval_single)
  #res <- foreach(i = 1:params$num_realiz) %do% eval_single(X[[i]])
  res <- mclapply(X, eval_single, mc.cores = 4)
  #res <- parLapply(cl = cl, X, eval_single)  # this requires: cl <- makeCluster(4); clusterExport(cl = cl, varlist = ls(envir = .GlobalEnv)); stopCluster(cl = cl)
  #res <- foreach(i = 1:params$num_realiz) %dopar% eval_single(X[[i]])  # this requires: registerDoParallel(4); stopImplicitCluster()

  # finally, keep track of MSEs and times
  MSE_all_T  <- rbind(MSE_all_T,  rowMeans(sapply(res, function(x) x$MSE)))
  time_all_T <- rbind(time_all_T, rowMeans(sapply(res, function(x) x$time)))
}
library(reshape2)
library(dplyr)

# create data.frame by melting the nested list
res_all_T_molten <- melt(res_all_T)
names(res_all_T_molten) <- c("value", "method", "measure", "realization", "T")
res_all_T_molten <- res_all_T_molten %>%
  mutate(method = factor(method, levels = methods_names)) %>%
  mutate(T = as.numeric(T)) %>%
  group_by(method, measure, T) %>%
  summarize(value_mean = mean(value))  # mutate(value_mean = mean(value)) %>% ungroup()
num_methods <- length(unique(res_all_T_molten$method))
#if (anyNA(res_all_T_molten)) message("NAs found.")
library(ggplot2)
library(latex2exp)

# MSE plot
ggplot(res_all_T_molten[res_all_T_molten$measure == "MSE", ], aes(x = T, y = value_mean, color = method, shape = method)) +
  geom_line(linewidth = 0.8) + geom_point(size = 2) + scale_y_log10() +  # coord_cartesian(ylim = c(0, 500)) +
  #scale_color_manual(values = my_colors) 
  scale_shape_manual(values = 1:num_methods) +  # theme(legend.title = element_blank())
  scale_x_continuous(limits = c(first(T_sweep), last(T_sweep)), breaks = seq(first(T_sweep), last(T_sweep), by = 10), minor_breaks = NULL) +
  labs(title = TeX(sprintf("Estimation error of covariance matrix ($\\textit{N}$ = %d, $\\nu$ = %d)", N, nu)), 
       x = TeX("$\\textit{T}$"), y = "MSE")
# time plot
ggplot(res_all_T_molten[res_all_T_molten$measure == "time", ], aes(x = T, y = 1e0*value_mean, color = method, shape = method)) +
  geom_line(linewidth = 0.8) + geom_point(size = 2) + scale_y_log10() +
  scale_shape_manual(values = 1:num_methods) +
  scale_x_continuous(limits = c(first(T_sweep), last(T_sweep)), breaks = seq(first(T_sweep), last(T_sweep), by = 10), minor_breaks = NULL) +
  labs(title = TeX(sprintf("Computational cost ($\\textit{N}$ = %d, $\\nu$ = %d)", N, nu)), 
       x = TeX("$\\textit{T}$"), y = "CPU time [sec]")

The following plot gives a nice overall perspective of the MSE vs. CPU time tradeoff of the different methods (note the ellipse at the bottom left that embraces the best four methods: fitHeavyTail::fit_Tyler(), fitHeavyTail::fit_Cauchy(), fitHeavyTail::fit_mvt(), and fitHeavyTail::fit_mvt() with fixed nu = 6):

library(ggforce)  # for geom_ellipse()

# joint MSE-time plot
mse4  <- tail(res_all_T_molten[res_all_T_molten$T == 100 & res_all_T_molten$measure == "MSE", ]$value_mean, 4)    # for the ellipse (last 4 methods)
time4 <- tail(res_all_T_molten[res_all_T_molten$T == 100 & res_all_T_molten$measure == "time", ]$value_mean, 4)    # for the ellipse (last 4 methods)
ggplot(data.frame("MSE" = res_all_T_molten[res_all_T_molten$T == 100 & res_all_T_molten$measure == "MSE", ]$value_mean, 
                  "cpu_time" = res_all_T_molten[res_all_T_molten$T == 100 & res_all_T_molten$measure == "time", ]$value_mean, 
                  "method" = unique(res_all_T_molten$method)), 
       aes(x = cpu_time, y = MSE, col = method)) +
  geom_point(size = 3) + scale_x_log10() +
  geom_ellipse(aes(x0 = mean(time4), y0 = mean(mse4), a = 70*(max(time4) - min(time4)), b = 1.0*(max(mse4) - min(mse4)), angle = 0), 
               col = "black", size = 0.5) +
  labs(title = TeX(sprintf("Performance vs CPU time ($\\textit{N}$ = %d, $\\textit{T}$ = %d, $\\nu$ = %d)", N, 100, nu)), 
       x = "CPU time [sec]", y = "MSE")

From the numerical results we can draw several observations:

Concluding, the top choices seem to be (in order):

  1. fitHeavyTail::fit_mvt() (either without fixing nu or with nu = 6),
  2. fitHeavyTail::fit_Cauchy(),
  3. fitHeavyTail::fit_Tyler(), and
  4. MASS::cov.trob() (with the advantage of being preinstalled with base R, but with a worse estimation error).

The overall winner is fitHeavyTail::fit_mvt() by a big margin.

Extension to Skewed Distributions

The empirical distribution of daily returns of some financial variables, such as exchange rates, equity prices, and interest rates, is often skewed. There are several different formulations of multivariate skewed $t$ distributions appearing in the literature [@lee2014finite] [@aas2006generalized]. The package now supports the multivariate generalized hyperbolic (GH) skewed $t$ distribution and provides a method to estimate the parameters of such distribution. It is implemented in the function fitHeavyTail::fit_mvst(). Below is a simple example to illustrate its usage:

# parameter setting for GH Skewed t distribution
N <- 5
T <- 200
nu <- 6
mu <- rnorm(N)
scatter <- diag(N)
gamma <- rnorm(N)

# generate GH Skew t data via hierarchical structure
taus <- rgamma(n = T, shape = nu/2, rate = nu/2)
X <- matrix(data = mu, nrow = T, ncol = N, byrow = TRUE) +
     matrix(data = gamma, nrow = T, ncol = N, byrow = TRUE) / taus +
     mvtnorm::rmvnorm(n = T, mean = rep(0, N), sigma = scatter) / sqrt(taus)

# fit GH Skew t model
fitted <- fit_mvst(X)

Algorithms

In essence, all the algorithms are based on the maximum likelihood estimation (MLE) of some assumed distribution given the observed data. The difficulty comes from the fact that the optimal solution to such MLE formulations becomes too involved in the form of a fixed-point equation and the framework of Majorization-Minimization (MM) algorithms [@SunBabPal2017] becomes key to derive efficient algorithms.

In some cases, the probability distribution function becomes too complicated to manage directly (like the multivariate Student's $t$ distribution) and it is necessary to resort to a hierarchical distribution that involves some latent variables. In order to deal with such hidden variables, one has to resort to the Expectation-Maximization (EM) algorithm, which interestingly is an instance of the MM algorithm.

The following is a concise description of the algorithms used by the three fitting functions (note that the current version of the R package fitHeavyTail does not allow yet a regularization term with a target):

$$\nu_{k+1} = \arg\min_\nu \left{\frac{\nu}{2}\sum_{t=1}^{T}\left(\textsf{E}_k[\tau_t] - \textsf{E}_k[\log{\tau_t]}\right) - \frac{\nu}{2}T\log{\frac{\nu}{2}} + T\log{\Gamma\left(\frac{\nu}{2}\right)}\right};$$

$$\nu_{k+1} = \arg\min_\nu \left{ \frac{N + \nu}{2}\sum_{t=1}^{T}\log{\left(\nu + \left(\bm{x}t - \bm{\mu}{k+1}\right)\bm{\Sigma}{k+1}^{-1}\left(\bm{x}_t - \bm{\mu}{k+1}\right)^\T\right)}\\hspace{6cm} - T\log{\Gamma\left(\frac{N + \nu}{2}\right)} + T\log{\Gamma\left(\frac{\nu}{2}\right)} - \frac{\nu}{2}T\log{\nu} \right};$$

The initial point is $\bm{\mu}{0} = \frac{1}{T}\sum{t=1}^{T}\bm{x}t$, $\bm{\Sigma}{0} = \frac{\nu_0-2}{\nu_0}\bm{S}$, and $\nu_0 = 2/\kappa + 4$, with $\kappa = \left[\frac{1}{3}\frac{1}{N}\sum_{i=1}^N \textsf{kurt}i\right]^+$ and $$\textsf{kurt}_i = \frac{(T-1)}{(T-2)(T-3)}\left((T+1)\left(\frac{m_i^{(4)}}{\big(m_i^{(2)}\big)^2} - 3\right) + 6\right),$$ where $m_i^{(q)}=\frac{1}{T}\sum{t=1}^\T(x_{it}-\bar{x}_i)^q$ denotes the $q$th order sample moment. The algorithm with missing values in $\bm{x}_t$ becomes more cumbersome but it is essentially the same idea. This function can also incorporate a factor model structure on the covariance matrix $\bm{\Sigma} = \bm{B}\bm{B}^\T + {\sf Diag}(\bm{\psi})$, which requires a more sophisticated algorithm [@ZhouLiuKumarPalomar2019].

References {-}



dppalomar/fitHeavyTail documentation built on June 5, 2023, 4:52 a.m.