Example: Hidden Markov Model"

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.align = "center",
  fig.path = "figures/hmm-",
  fig.dim = c(8, 6), 
  out.width = "75%",
  # all optimizations are pre-computed to save building time
  eval = FALSE
)
library("ino")
options("ino_verbose" = TRUE) 
data("hmm_ino")
set.seed(1)
ggplot2::theme_set(ggplot2::theme_minimal())

In this vignette^[The vignette was build using R r paste(R.Version()[c("major","minor")], collapse = ".") with the {ino} r utils::packageVersion("ino") package.] we describe the workflow of the {ino} package for fitting a hidden Markov model (HMM) through likelihood optimization. HMMs are statistical models used to explain systems that can only be observed indirectly through a sequence of outputs, and have unobservable hidden states. They consist of two processes, a Markov process that describes the hidden states, and an observable process that describes the outputs produced by the system, where the probability of observing an output at a given time depends only on the current state. State-switching models like HMMs are commonly used in speech recognition, animal movement modeling, bioinformatics, and finance. For more technical details about HMMs and their scope of application, see, e.g., @Zucchini:2016.

Application to financial data

The example data set considered throughout this vignette covers a time series of log returns from the German stock index DAX over 30 years. The DAX closing prices are freely accessible via Yahoo Finance and can be downloaded via the download_data() function from the {fHMM} package [@Oelschläger:2023b]. They can easily be transformed to log returns using the {dplyr} package [@Wickham:2023]:

library("fHMM")
library("dplyr")
dax <- download_data(symbol = "^GDAXI", from = "1990-01-01", to = "2020-01-01") %>%
  as_tibble() %>%
  reframe(
    date = as.Date(Date, format = "%Y-%m-%d"),
    logreturn = c(NA, diff(log(Close), lag = 1))
  ) %>%
  filter(!is.na(logreturn)) %>%
  print()

The time series looks as follows:

library("ggplot2")
ggplot(dax, aes(x = date, y = logreturn)) +
  geom_point() +
  geom_line() +
  scale_x_date() +
  scale_y_continuous(labels = scales::label_percent())

As the log returns are continuous and can take both negative and positive values, we consider an HMM with Gaussian state-dependent distributions --- note that some applications instead use t-distributions to also model the kurtosis [@Oelschläger:2021].

Likelihood optimization

We consider a 2-state (N = 2) Gaussian-HMM here to model bearish and bullish market periods. This results in six parameters (npar = 6) to be estimated:

The likelihood for a Gaussian-HMM is given by the function f_ll_hmm() provided by the {ino} package. The argument neg = TRUE indicates that we minimize the negative log-likelihood, and set_optimizer(optimizer_nlm()) selects the optimizer stats::nlm() (see the introductory vignette for more details on how to specify optimizers).

hmm_ino <- Nop$new(
  f = f_ll_hmm, 
  npar = 6, 
  data = dax$logreturn, 
  N = 2, 
  neg = TRUE
)$
set_optimizer(optimizer_nlm())

Random initialization

Randomly chosen starting values is a first initialization approach:

The $optimize() method with argument initial = sampler then performs runs = 100 optimization runs with starting values drawn from the specified distributions:

sampler <- function() {
  c(stats::runif(2, -2, -1), stats::rnorm(2), log(stats::runif(2, 0.5, 2)))
}
hmm_ino$optimize(initial = sampler, runs = 100, label = "random")

Educated guesses

For selecting fixed starting values, we consider sets of starting values that fall in the ranges considered above. These can be considered as "educated guesses" and are likely to be close to the global optimum:

tpm_entry_1 <- tpm_entry_2 <- c(-2, -2.5)
mu_1 <- c(0, -0.05)
mu_2 <- c(0, 0.05)
sd_1 <- c(log(0.1), log(0.5))
sd_2 <- c(log(0.75), log(1))
starting_values <- asplit(expand.grid(
  tpm_entry_1, tpm_entry_2, mu_1, mu_2, sd_1, sd_2), 
  MARGIN = 1
)

This grid results in a total of r length(starting_values) starting values, which can be specified as the initial argument:

hmm_ino$optimize(initial = starting_values, label = "educated_guess")

Subset initialization

Since the data set is large, containing a total of r nrow(dax) share return observations, it might be beneficial to obtain initial values by first fitting the model to a data subset. If the data subset is chosen small enough, estimation with the subset will be much faster. On the other hand, if the data subset is chosen large enough to still contain enough information, the estimates on the subset will already lie close to the estimates for the full model.

To illustrate the subset initialization strategy, we consider the first quarter of observations, which can be extracted using the $reduce() method with arguments how = "first" and prop = 0.25. The starting values for the optimizations on this subset are drawn from the sampler() function defined above. We again use $optimize() to fit the HMM, but now to the data subset. With $continue(), we then use the estimates obtained from the optimization on the subset as initial values to fit the model to the entire data set. The entire data set is restored via $reset_argument("data").^[If we were to skip the $reset_argument() step, all future optimization runs would be made on the subset.]

hmm_ino$
  reduce("data", how = "first", prop = 0.25)$
  optimize(initial = sampler, runs = 100, label = "subset")$
  reset_argument("data")$
  continue()

Standardize initialization

The considered log returns range r paste(c("from", "to"), round(range(dax$logreturn), 1), collapse = " "). Optimization might be facilitated by standardizing the data first. This idea can be tested via the $standardize() method:

hmm_ino$
  standardize("data")$
  optimize(initial = sampler, runs = 100, label = "standardize")$
  reset_argument("data")

Note that we again use $reset_argument() to obtain our actual (non-standardized) data set.

Evaluating the optimization runs

Global versus local optima

Selecting the starting values for the HMM likelihood optimization is a well-known issue, as poor starting values may likely result in local optima.^[Other R packages designed to fit HMMs discuss this topic in more detail, see, e.g., @Michelot:2016.] We thus first evaluate the optimizations by comparing the likelihood values at convergence, which can be displayed using the $optima() method. Here,

hmm_ino$optima(sort_by = "value", only_comparable = TRUE, digits = 0)
library("dplyr")
optima <- hmm_ino$optima(sort_by = "value", only_comparable = TRUE, digits = 0)
global <- optima %>% arrange(value) %>% slice(1) %>% pull(frequency)
total <- sum(optima$frequency)
local <- total - global

The frequency table indicates that r global out of r total runs converged to the smallest (negative) log-likelihood value, which appears to be the global optimum (note that these are the negative log-likelihood values). However, in r local runs we apparantly got stuck in local optima.

Using summary(), we now can investigate the optimum values ("value"), the corresponding parameter vectors ("parameter"), and the optimization times ("seconds") of all runs (here, only the first ten are shown):^[The $elements_available() method provides an overview of all available optimization result elements.]

summary(hmm_ino, which_element = c("value", "parameter", "seconds")) %>% 
  head(n = 10)

The final parameter estimates (i.e., the parameters associated with the global optimum) can be accessed via $best_parameter() ...

hmm_ino$best_parameter()

.. and the corresponding likelihood value as:

hmm_ino$best_value()

Note that the attributes "run" and "optimizer" provide the information, in which optimization run and with which optimizer this estimate was obtained.

We can compute the proportion of runs that lead to the apparent global optimum of r format(as.numeric(hmm_ino$best_value()), scientific = FALSE) as follows:

summary(hmm_ino, c("label", "value", "seconds"), global_optimum = "value < -22445", only_comparable = TRUE) %>% 
  group_by(label) %>% 
  summarise(proportion = mean(global_optimum, na.rm = TRUE))

While for the educated guesses about 55\% of the runs converge to the global optimum, the random initialization and subset initialization strategies get stuck more often in local optima. Note that the standardized initialization approach cannot be compared to the other approaches here.

Optimization time

The plot() function can be used to investigate the optimization times across initialization strategies (by = "label").^[Setting relative = FALSE in the plot() call shows the absolute times.]

plot(hmm_ino, by = "label")

Using the output provided by summary() and some data manipulation functions provided by the package {dplyr}, we can also compute summary statistics of interest, like the median computation time or standard deviation per strategy:

summary(hmm_ino, c("label", "seconds")) %>% 
  group_by(label) %>% 
  summarise(
    median_seconds = median(seconds, na.rm = TRUE),
    sd_seconds = sd(seconds, na.rm = TRUE)
  ) %>%
  arrange(median_seconds)

The subset and the standardize approach can improve the median optimization time by a factor of about 2 in this example compared to the random initialization approach.

References



Try the ino package in your browser

Any scripts or data that you put into this service are public.

ino documentation built on Sept. 29, 2023, 5:09 p.m.