Here we will use the bayesdfa package to fit dynamic factor analysis (DFA) model to simulated time series data. In addition to working through an example of DFA for multivariate time series, we'll apply bayesdfa routines for fitting hidden Markov models (HMMs) to the estimated trends to identify latent regimes. Most of the core functions of the package are included here, including `fit_dfa()`

and `find_dfa_trends`

for fitting DFA models, `plot_trends()`

, `plot_fitted()`

and `plot_loadings()`

for plotting estimates, `find_swans()`

for flagging extremes, `fit_regimes()`

and `find_regimes()`

for fitting HMM models, and `plot_regimes()`

for plotting HMM output.

library("knitr") opts_chunk$set(message = FALSE, fig.width = 5.5)

Let's load the necessary packages:

library(bayesdfa) library(ggplot2) library(dplyr) library(rstan) chains = 1 iter = 10

We adopt the same notation used in the MARSS package for dynamic factor analysis models. The DFA model consists of two models, one describing the latent process or states, and an observation or data model linking the process model to observations. Slight variations on this model are described below, but the process model for the basic DFA model is written as a multivariate random walk,

$$\textbf{x}*{t+1}= \textbf{x}*{t} + \textbf{w}*{t}$$
where the matrix $\textbf{x}$ is dimensioned as the number of years $N$ by the number of latent trends $K$. The process error is assumed to be multivariate normal, $\textbf{w}*{t} \sim MVN(0,\textbf{Q})$ where $\textbf{Q}$ is generally assumed to be a $K$-by-$K$ identity matrix.

The observation or data model linking $\textbf{x}*{t}$ to observed data $\textbf{y}*{t}$ is

$$\textbf{y}*{t} = \textbf{Zx}*{t} + \textbf{B}\textbf{d}*{t} + \textbf{e}*{t}$$
The matrix $Z$ represents a matrix of estimated loadings, dimensioned as number of time series $P$ by number of latent trends $K$. Optional covariates $\textbf{d}*{t}$ are included in the observation model with estimated coefficients $B$. The residual errors $\textbf{e}*{t}$ are assumed to be normally distributed, e.g. $\textbf{e}_{t} \sim MVN(0, \textbf{R})$. There are a number of choices for $\textbf{R}$ -- these can be a diagonal matrix with equal or unequal elements, or an unconstrained covariance matrix.

First, let's simulate some data. We will use the built-in function `sim_dfa()`

, but normally you would start with your own data. We will simulate 20 data points from 4 time series, generated from 2 latent processes. For this first dataset, the data won't include extremes, and loadings will be randomly assigned (default).

set.seed(1) sim_dat <- sim_dfa( num_trends = 2, num_years = 20, num_ts = 4 )

```r"} df = data.frame("time" = rep(1:20,4),"y"=c(t(sim_dat$y_sim)), "ts"=as.factor(sort(rep(1:4,20)))) ggplot(df, aes(time,y,group=ts,col=ts)) + geom_line() + theme_bw() + xlab("Time") + ylab("Observed data")

Next, we'll fit a 1-trend, 2-trend, and 3-trend DFA model to the simulated time series using the `fit_dfa()` function. Starting with the 1-trend model, we'll estimate the posterior distributions of the trends and loadings. Note that this example uses 1 MCMC chain and 50 iterations --- for real examples, you'll want to use more (say 4 chains, 5000 iterations). ```r f1 <- fit_dfa( y = sim_dat$y_sim, num_trends = 1, scale="zscore", iter = iter, chains = chains, thin = 1 )

Convergence of DFA models can be evaluated with our `is_converged()`

function. This function takes a fitted object, and specified `threshold`

argument representing the maximum allowed Rhat value (default = 1.05). The convergence test isn't that useful for a model with such a short number of iterations, but is called with

is_converged(f1, threshold = 1.05)

This function evaluates Rhat values for all parameters and log likelihood values - so be sure to check what's not converging if the model is not passing this test.

Before we extract the trends from the model, we need to rotate the loadings matrix and trends. By default we use the varimax rotation, implemented in the `rotate_trends()`

function. An optional argument is the `conf_level`

argument, which calculates the specified confidence (credible) interval of the estimates (by default, this is set to 0.95).

r <- rotate_trends(f1)

The rotated object has several quantities of interest, including the mean values of the trends "trends_mean" and loadings "Z_rot_mean",

```
names(r)
```

We can then plot the trends and intervals, with ```r"} plot_trends(r) + theme_bw()

We can also plot the estimated loadings (we'll show that plot for the more complicated 2-trend model below because it's not as interesting for the 1-trend model) and the fitted values. To plot the fitted values from the 1-trend model, we'll use the `plot_fitted()` function (predicted values can also be returned without a plot, with the `predicted()`) function. The trends and intervals are plotted, faceting by time series, with ```r"} plot_fitted(f1) + theme_bw()

Moving to a more complex model, we'll fit the 2-trend and 3-trend models. All other arguments stay the same as before,

f2 <- fit_dfa( y = sim_dat$y_sim, num_trends = 2, scale="zscore", iter = iter, chains = chains, thin = 1 ) r2 <- rotate_trends(f2) f3 <- fit_dfa( y = sim_dat$y_sim, num_trends = 3, scale="zscore", iter = chains, chains = chains, thin = 1 ) r3 <- rotate_trends(f3)

The fits from the 2-trend model look considerably better than that from the 1-trend model, ```r"} plot_fitted(f2) + theme_bw()

The loadings from the 1-trend model aren't as interesting because for a 1-trend model the loadings are a 1-dimensional vector. For the 2 trend model, there's a separate loading of each time series on each trend, ```r round(r2$Z_rot_mean, 2)

These loadings can also be plotted with the `plot_loadings()`

function. This shows the distribution of the densities as violin plots, with color proportional to being different from 0.

```r"} plot_loadings(r2) + theme_bw()

Finally, we might be interested in comparing some measure of model selection across these models to identify whether the data support the 1-trend, 2-trend, or 3-trend models. The Leave One Out Information Criterion can be calculated with the `loo()` function, for example the LOOIC for the 1-trend model can be accessed with ```r loo1 <- loo(f1) loo1$estimates

where `r loo1$estimates["looic","Estimate"]`

is the estimate and `r loo1$estimates["looic","SE"]`

is the standard error.

As an alternative to fitting each model individually as we did above, we also developed the `find_dfa_trends()`

to automate fitting a larger number of models. In addition to evaluating different trends, this function allows the user to optionally evaluate models with normal and Student-t process errors, and alternative variance structures (observation variance of time series being equal, or not). For example, to fit models with 1:5 trends, both Student-t and normal errors, and equal and unequal variances, the call would be

m <- find_dfa_trends( y = s$y_sim, iter = iter, kmin = 1, kmax = 5, chains = chains, compare_normal = TRUE, variance = c("equal", "unequal") )

In this example, we'll simulate data with an extreme anomaly. The biggest difference between this model and the conventional model is that in the DFA process model,

$$\textbf{x}*{t+1}= \textbf{x}*{t} + \textbf{w}_{t}$$

instead of $\textbf{w}*{t}$ being normally distributed, we assume $\textbf{w}*{t}$ is Student-t distributed. With multiple trends, this becomes a multivariate Student-t,

$$\textbf{w}_{t} \sim MVT(\nu, 0, \textbf{Q})$$ The parameter $\nu$ controls how much the tails of this distribution deviate from the normal, with smaller values ($\nu$ closer to 2) resulting in more extreme anomalies, and large values ($\nu$ closer to 30) resulting in behavior similar to a normal distribution.

As before, this will be 20 data points from 4 time series, generated from 2 latent processes. The `sim_dfa()`

function's arguments `extreme_value`

and `extreme_loc`

allow the user to specify the magnitude of the extreme (as an additive term in the random walk), and the location of the extreme (defaults to the midpoint of the time series). Here we'll include an extreme value of 6,

set.seed(1) sim_dat <- sim_dfa( num_trends = 2, num_years = 20, num_ts = 4, extreme_value = 6 )

Plotting the data shows the anomaly occurring between time step 9 and 10, ```r"} matplot(t(sim_dat$y_sim), type = "l", ylab = "Response", xlab = "Time")

Though the plot is a little more clear if we standardize the time series first, ```r"} matplot(scale(t(sim_dat$y_sim)), type = "l", ylab = "Response", xlab = "Time")

Instead of fitting a model with normal process deviations, we may be interested in fitting the model with Student-t deviations. We can turn on the estimation of `nu`

with the `estimate_nu`

argument. [Alternatively, nu can also be fixed a priori by setting the argument `nu_fixed`

]. Here's the code for a 2-trend model with Student-t deviations,

t2 <- fit_dfa( y = sim_dat$y_sim, num_trends = 2, scale="zscore", iter = iter, chains = chains, thin = 1, estimate_nu = TRUE )

Again we have to rotate the trends before plotting, ```r"} r <- rotate_trends(t2) plot_trends(r) + theme_bw()

And the loadings, ```r"} plot_loadings(r) + theme_bw()

One way to look for extremes is using the `find_swans()`

function, which evaluates the probability of observing a deviation in the estimated trend (or data) greater than what is expected from a normal distribution. This function takes a `threshold`

argument, which specifies the cutoff. For example, to find extremes greater than 1 in 1000 under a normal distribution, the function call is

find_swans(r, plot = FALSE, threshold = 1 / 1000)

Setting plot to TRUE also creates a time series plot that flags these values.

We can also look at the estimated `nu`

parameter, which shows some support for using the Student-t distribution (values greater than ~ 30 lead to similar behavior as a normal distribution),

summary(rstan::extract(t2$model, "nu")[[1]])

We've implemented a number of alternative families for cases when the response variable might be non-normally distributed. These alternative families may be specified with the family argument as a text string in the `fit_dfa`

function, e.g.

f <- fit_dfa(..., family = "poisson")

The currently supported families can be specified as any of the following -- the link functions are currently hard-coded, and included in the table below.

m = cbind(c("gaussian","lognormal","gamma","binomial","poisson","nbinom2"), c("identity","log","log","logit","log","log")) colnames(m) = c("Family","link") knitr::kable(m)

By default, the loadings matrix in a DFA is constrained by zeros. For example, a 3-trend model applied to 5 time series would have a loadings matrix that was constrained as

Z = matrix("",5,3) for(i in 1:5) { for(j in 1:3) { Z[i,j] = paste0("z[",i,",",j,"]") } } Z[1,2:3] = "0" Z[2,3] = "0" colnames(Z) = c("Trend 1","Trend 2","Trend 3") knitr::kable(Z)

As an alternative, we may wish to fit a model where each time series arises as a mixture of the trends. In this case, the loadings matrix would be

Z = matrix("",5,3) for(i in 1:5) { for(j in 1:3) { Z[i,j] = paste0("z[",i,",",j,"]") } } colnames(Z) = c("Trend 1", "Trend 2", "Trend 3") knitr::kable(Z)

And the added constraint is that each row sums to 1, e.g.

$$\sum_{j=1:3} Z_{1,j} = 1$$

For some models, it may be appropriate to include autoregressive or moving average components to model the latent trends. We've implemented 1st - order components on each, though by default these are not included.

To include the AR(1) component on the trend, you can specify

fit <- fit_dfa(..., estimate_trend_ar = TRUE)

This results in a model where trend $i$ is modeled as

$$x_{i,t+1} = \phi_{i}*x_{i,t} + \delta_{i,t}$$ Each trend is allowed to have a unique AR(1) parameter, $\phi_{i}$.

In conventional DFA models, the process deviations are assumed to be independent, e.g. $\delta_{i,t} ~ Normal(0, r)$. By including a MA(1) component on the trends, these terms may be modeled as

$$delta_{i,t+1} \sim Normal( \theta_{i}*delta_{i,t}, q_{i})$$

where $\theta_{i}$ is the trend-specific MA parameter, $q$ is the process variance, and usually constrained to be not estimated and fixed at 1.

Finally, we might be interested in evaluating evidence for the estimated trends changing between multiple regimes. We've implemented HMMs in the functions `fit_regimes()`

and `find_regimes()`

. `fit_regimes`

fits a HMM with a pre-specified number of latent states, while `find_regimes()`

loops over a range of models so that support can be evaluated with LOOIC.

We'll illustrate an example of the `fit_regimes()`

function. Note that in the current version of the package, this has to be applied to each trend separately. Also, uncertainty estimates from the DFA trends may also be included as data (instead of estimated) -- this may be particularly useful for datasets with lots of missing values in portions of the time series (where the uncertainty balloons up).

Applying the 2-regime model to the first trend,

reg_mod <- fit_regimes( y = r$trends_mean[1, ], sds = (r$trends_upper - r$trends_mean)[1, ] / 1.96, n_regimes = 2, iter = 50, chains = 1 )

In addition to getting diagnostics and quantities like LOOIC out of the `reg_mod`

object, we can plot the estimated states.

```r"} plot_regime_model(reg_mod)

In this case, there might be some support for the 2-regime model. Sometimes (but not in this example) label switching makes the plots a little challenging to interpret, and the labels need to be reversed. We've included the argument `flip_regimes` to help with this, ```r"} plot_regime_model(reg_mod, flip_regimes = TRUE)

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.