In jtm508/bayestraj:

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

In this vignette we will demonstrate how to use BayesTraj to estimate a Bayesian Dual GBTM with normal likelihoods. We will use simulated data in order to verify that the estimation routine can recover the true parameters, and to demonstrate how the data should be formatted before calling the estimation routines.

Begin by loading the BayesTraj library:

library(BayesTraj)

Simulating Data

First, we will simulate data. This will not be necessary in your own projects, but it is useful both for testing the package and for using as a template for formatting your own datasets.

N=1000 #number of units
T1=9 #Time periods for Group 1
T2=9 #Time periods for Group 2
pi1=c(0.5,0.2,0.3) #Group 1 membership probabilities
#Transition Matrix
pi1_2=matrix(c(0.3,0.3,0.4,
             0.49,0.50,0.01,
             0.7,0.2,0.1),
           nrow=3,ncol=3,byrow=TRUE)
K1 = length(pi1) #Number of groups in series 1
K2 = dim(pi1_2)[2] #Number of groups in series 2
#Coefficients for Series 1
beta1=matrix(c(110,5,-0.5,
               111,-2,0.1,
               118,3,0.1),nrow=3,ncol=3,byrow=TRUE)
#Coefficients for Series 2
beta2=matrix(c(110,6,-0.6,
               111,-3,0.1,
               112,2,0.7),nrow=3,ncol=3,byrow=TRUE)
sigma1=2 #standard deviation of Series 1 outcomes
sigma2=4 #standard deviation of Series 2outcomes

set.seed(1)
data = gen_data_dual(N=N,
                T1=T1,
                T2=T2,
                pi1=pi1,
                pi2=pi1_2,
                beta1=beta1,
                beta2=beta2,
                sigma1=sigma1,
                sigma2=sigma2,
                poly = 2) #degree of polynomial

In this example we have simulated data for 1000 paired-units with 9 time periods each, for a total of 18000 observations. While we have set each unit to have observations in 9 time periods, the estmation function below allows for the number of observations to vary. We have chosen the Group 1 membership probabilities to be 50%, 20%, and 30%. pi1_2 is the transition matrix. pi1_2[i,j] represents the probability that the second pair member is in Group j conditional on the first pair member being in Group i. From this, the gen_data_dual function can infer that there should be three groups for both series.

Each row of the beta matrices defines the trajectory coefficients for the respective group. For example, the expected value at time $t$ in Series 1 Group 1 is $110+5t-0.5t^2$. Sigma1 and sigma2 define the standard deviation of the outcomes.

When calling the gen_data_dual function, we also specify poly=2 in order to tell model to use a second-degree polynomial for time. If there are more non-intercept columns of beta than poly, gen_data_dual will generate random covariates corresponding to the remaining columns. In general, the last poly columns of the beta matrices correspond to the polynomial coefficients.

Now let's take a look at the generated data. We can unpack the individual attributes from the data object.

X1=data$X1
X2=data$X2
y1=data$Y1
y2=data$Y2

While we will restrict our discussion to the first series, everything applies to the second series as well. The first 18 rows of X1 are:

print(head(X1,18))

The first column identifies the unit. For example, the first 9 rows correspond to unit 1, the second 9 rows correspond to unit 2, and so forth. The second column is the time variable. Rows 1 and 10 correspond to time 1, rows 2 and 11 correspond to time 2, and so forth. Similarly, the third column is the square of the time column.

Now we take a look at y1. These are the outcomes. y1[1] corresponds to the outcome for unit 1 at time 1. y1[2] corresponds to the outcome for unit 1 at time 2, and so forth. The values of y1 must correspond with the rows of X1. Therefore X1 and y1 should have the same length.

print(head(y1,18))

Estimating the model

We now turn our attention toward estimating the model. We can do this by calling the dualtraj function.

iter = 5000
thin = 1
z1 = matrix(1,nrow=K1,ncol=dim(X1)[2])
z2 = matrix(1,nrow=K2,ncol=dim(X2)[2])
model = dualtraj(X1=X1, #data matrix Series 1
                 X2=X2, #data matrix Series 2
                 y1=y1, #outcomes Series 1
                 y2=y2, #outcomes Series 2
                 K1=K1, #number of groups Series 1
                 K2=K2, #number of groups Series 2
                 z1=z1, #functional form matrix Series 1
                 z2=z2, #functional form matrix Series 2
                 iterations=iter, #number of iterations
                thin=thin, #thinning
                 dispIter=1000) #Print a message every 1000 iterations

Here we run the model for 5000 MCMC iterations. Setting the thin parameter to 1 tells us to keep every sample. We can set thin=10, for example, to only keep 1 out of every 10 samples. Thinning is not necessary unless your computer has memory limitations. We also set dispIter=1000 to tell the program to send us a message every 1000 MCMC iterations. This will help us monitor the progress of the program.

The only arguments we have not touched on yet is z1 and z2. z1 allows us to specify which columns of X1 should be included in the model for each group. If X1 has $d$ columns, then z1 should be a $K1 \times d$ matrix. In practice, this allows us to specify different polynomial degrees for different groups. Z1[i,j] indicates whether to include the $j$th column of X1 in Series 1 Group i's model. The first column of z1 corresponds to the intercept and should always be set to 1. In this example, have set z1[i,j]=1 for each (i,j), indicating we'd like to use a second-degree polynomial for all three groups. The same applies to z2.

Some researchers prefer to specify different polynomial in each group. For example, a researcher who wants 2nd-degree polynomials in groups 1 and 2, but a 1-degree polynomial in group 3, could specify z1 as follows:

z1_=matrix(c(1,1,1,
             1,1,1,
             1,1,0),nrow=3,ncol=3,byrow=TRUE)

Analyzing the Model

The model object contains the MCMC samples for each of the model's parameters. We can access the MCMC samples as follows, where each row represent an iteration of the MCMC:

head(model$beta1[[1]]) #Series 1 group 1's coefficients
head(model$beta1[[2]]) #Series 1 group 2's coefficients
head(model$beta1[[3]]) #Series 1 group 3's coefficients
head(model$sigma1) #Series 1 variance - NOT THE STANDARD DEVIATION
model$c1[1:6,1:10] #Series 1 unit-level group memberships
head(model$pi1) #Series 1  group-membership probabilities
head(model$pi2) #Series 2  group-membership probabilities
model$pi1_2[1,,] #Transition probabilities from Series 1 Group 1.
model$pi12[1,,] #Joint probability of both Series group memberships

A conveniant way to summarize the posterior is with the summary_dual function:

burn = 0.9
summary = summary_dual(model,X1,X2,y1,y2,z1,z2,burn)

The burn parameter specifies the fraction of draws to keep. In this example, we keep the last 90% of MCMC samples. The first 10% are discarded as the burn-in period.

We can now print out a posterior summary to obtain the posterior mean, standard deviation, and 95% credible interval, as follows:

print(summary$estimates)

The notation for the transitions and joint probabilities requires some understanding. 1->2,i->j is $\Pr(c_{2}=j|c_{1}=i)$ and 2->1,i->j is $\Pr(c_{1}=j|c_{2}=i)$. The final columns $i,j$ denote the joint probabilities $\Pr(c_{1}=i \cap c_{2}=j)$.

Checking for Label Switching and Local Modes

One issue with GBTMs is the tendency for estimation routines to find a local mode which is not globally optimal. This is a problem for GBTMs estimated using maximum likelihood as well. To increase the probability that we are in a global optimum rather than a local optimum, we often run the Gibbs sampler using several seeds and print out the likelihood at the posterior mean:

print(summary$log.likelihood)

We then use the seed which maximize the likelihood. This solution has no optimality guarantees, but we have found that we can often reach better optimas this way than other existing packages using maximum likelihood.

The main drawback of the Bayesian approach is the tendency for label-switching and mode-switching. In the label-switching problem, the group labels switch in the middle of the algorithm. As a consequence, the group labeled "1" for the first 1000 draws may be labeled "2" in the second 1000 draws and vice versa. This would render and posterior summary of these coefficients meaningless. In our experience, label switching has not been a problem. However, switching between local-modes during the sampling process has occasionally been an issue.

There is no consensus for the best way to deal with label and mode switching. Either problem can be easily observed by plotting the draws sequentially and checking for sudden and sustained breaks in the trend. For example, the plot below looks consistent throughout the post-burn-in samples:

plot(model$beta1[[1]][1000:5000,1],type='l')

This indicates that neither label-switching nor mode-siwtching occured.

If we do observe a sudden break, there are multiple possible solutions. From our experience, we usually find that re-estimating the model using a different seed will solve the problem with least amount of effort.

jtm508/bayestraj documentation built on May 5, 2020, 12:48 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com