# Estimating process trend variability with bayesdfa In bayesdfa: Bayesian Dynamic Factor Analysis (DFA) with 'Stan'

A constraint of most DFA models is that the latent trends are modeled as random walks with fixed standard deviations (= 1). We can evaluate the ability to estimate these parameters in a Bayesian context below.

```library("knitr")
opts_chunk\$set(message = FALSE, fig.width = 5.5)
```

```library(bayesdfa)
library(ggplot2)
library(dplyr)
library(rstan)
chains = 1
iter = 10
```

## Case 1: unequal trend variability

First, let's simulate some data. We will use the built-in function `sim_dfa()`, but normally you would start with your own data. We will simulate 20 data points from 4 time series, generated from 2 latent processes. For this first dataset, the data won't include extremes, and loadings will be randomly assigned (default).

```set.seed(1)
sim_dat <- sim_dfa(
num_trends = 2,
num_years = 20,
num_ts = 4
)
```

We'll tweak the default code though to model slightly different random walks, with different standard deviations for each trend. We'll assume these standard deviations are 0.3 and 0.5, respectively.

```set.seed(1)
sim_dat\$x[1,] = cumsum(rnorm(n=ncol(sim_dat\$x), 0, 0.1))
sim_dat\$x[2,] = cumsum(rnorm(n=ncol(sim_dat\$x), 0, 1))
```

Looking at the random walks, we can see the 2nd trend (red dashed line) is more variable than the black line.

```r"} matplot(t(sim_dat\$x), type = "l", ylab = "Response", xlab = "Time" )

```Next, we'll calculate the predicted values of each time series and add observation error
```r
sim_dat\$pred = sim_dat\$Z %*% sim_dat\$x
for(i in 1:nrow(sim_dat\$pred)) {
for(j in 1:ncol(sim_dat\$pred)) {
sim_dat\$y_sim[i,j] = sim_dat\$pred[i,j] + rnorm(1,0,0.1)
}
}
```

### Candidate models

Next, we'll fit a 2-trend DFA model to the simulated time series using the `fit_dfa()` function. This default model fixed the variances of the trends at 1 -- and implicitly assumes that they have equal variance.

```f1 <- fit_dfa(
y = sim_dat\$y_sim, num_trends = 2, scale="zscore",
iter = iter, chains = chains, thin = 1
)
r1 <- rotate_trends(f1)
```

Next, we'll fit the model with process errors being estimated -- and assume unequal variances by trend.

```f2 <- fit_dfa(
y = sim_dat\$y_sim, num_trends = 2, scale="zscore", estimate_process_sigma = TRUE,
equal_process_sigma = FALSE,
iter = iter, chains = chains, thin = 1
)
r2 <- rotate_trends(f2)
```

Third, we'll fit the model with process errors being estimated, unequal variances by trend, and not z-score the data (this is just a test of the scaling).

```f3 <- fit_dfa(
y = sim_dat\$y_sim, num_trends = 2, scale="center", estimate_process_sigma = TRUE,
equal_process_sigma = FALSE,
iter = iter, chains = chains, thin = 1
)
r3 <- rotate_trends(f3)
```

Our fourth and fifth models will be the same as # 2 and # 3, but will estimate a single variance for both trends.

```f4 <- fit_dfa(
y = sim_dat\$y_sim, num_trends = 2, scale="zscore", estimate_process_sigma = TRUE,
equal_process_sigma = TRUE,
iter = iter, chains = chains, thin = 1
)
r4 <- rotate_trends(f4)

f5 <- fit_dfa(
y = sim_dat\$y_sim, num_trends = 2, scale="center", estimate_process_sigma = TRUE,
equal_process_sigma = TRUE,
iter = iter, chains = chains, thin = 1
)
r5 <- rotate_trends(f5)
```

As a reminder, let's look at the loadings from the original simulation model

```print(round(sim_dat\$Z,2))
```

The estimated loadings from the DFA where the trends are forced to have the same fixed variance are good

```print(round(r1\$Z_rot_mean,2))
```

but some of the loadings are far off. These loadings are also not well estimated for either of the models that estimate the process variances,

```print(round(r2\$Z_rot_mean,2))
```

or

```print(round(r3\$Z_rot_mean,2))
```

```print(round(r4\$Z_rot_mean,2))
```

and Model 5 by

```print(round(r5\$Z_rot_mean,2))
```

If we calculate the RMSE of the different models, model # 3 (estimated process trends, raw data not standardized) performs the best

```m = matrix(0,5,2)
m[,1] = 1:5
m[1,2] = sum((r1\$Z_rot_mean - sim_dat\$Z)^2)
m[2,2] = sum((r2\$Z_rot_mean - sim_dat\$Z)^2)
m[3,2] = sum((r3\$Z_rot_mean - sim_dat\$Z)^2)
m[4,2] = sum((r4\$Z_rot_mean - sim_dat\$Z)^2)
m[5,2] = sum((r5\$Z_rot_mean - sim_dat\$Z)^2)
knitr::kable(m)
```

### Recovering trends

Let's do the same kind of summary with the trends

```m = matrix(0,5,2)
m[,1] = 1:5
m[1,2] = sum((r1\$trends_mean - sim_dat\$x)^2)
m[2,2] = sum((r2\$trends_mean - sim_dat\$x)^2)
m[3,2] = sum((r3\$trends_mean - sim_dat\$x)^2)
m[4,2] = sum((r4\$trends_mean - sim_dat\$x)^2)
m[5,2] = sum((r5\$trends_mean - sim_dat\$x)^2)
colnames(m) = c("Model", "RMSE-trends")
knitr::kable(m)
```

These show that model 3 (trend variances are estimated, with data not standardized) performs best

### Summary

In this example, the estimation model that treated the variances of the random walks as parameters performed better than models that didn't. A caveat is that we simulated the random walk variances to be an order of magnitude difference between the 2 trends, more similar trends would need additional simulations and more thorough validation studies. Similarly, we found that not standardizing the raw time series data (instead, just centering each time series) yielded estimates of the loadings and estimated trends that were more similar to those in the simulation model. Standardizing each time series a priori yields slightly worse estimates of the trends and loadings (comparing models 2 v 3 and 4 v 5).

## Try the bayesdfa package in your browser

Any scripts or data that you put into this service are public.

bayesdfa documentation built on May 29, 2021, 1:06 a.m.