The following document provides R
code for fitting shrinkage prior Markov random field models using the spmrf
package applied to coalescent time data for phylodynamic inference of effective population size trajectories, as described in Faulkner et al. (2018). See the Introduction to the spmrf package for instructions on installation and more details about the package. Note that the spmrf
function only operates on coalescent time data and cannot be used directly on sequence data nor to infer parameters for trait evolution models.
Make sure you have spmrf
and rstan
installed, then load the libraries.
library(spmrf) library(rstan)
We will start with a set of simulated coalescent data generated from a scenario used in Faulkner et al. (2018). This example represents an extreme bottleneck in effective population size. First we simulate the coalescent data given the specified population trajectory, set up the grid over which we will estimate effective population size, and create the data list for input to spmrf
. Note that the effective population sizes are very small and that was done to decrease the coalescent times to make the time scale more compact.
# set random number seed for reproducibility set.seed(5) # Generate simulated data nsamp <- 500 #number of samples total nstart <- 50 #number of samples at time zero nbndry <- 101 #number of grid cell boundaries (number of cells plus 1) samp.end <- 8 #last potential sample time samptv <- c(0, sort(runif(nsamp-nstart, 0, samp.end)) ) #vector of sample times nsampv <- c(nstart, rep(1, nsamp-nstart)) #vector of number sampled # simulate coalescent times coaldat <- coaltimeSim(samp_times = samptv, n_sampled = nsampv, traj = bottleNeck_traj, lower_bound = 0.1, ne.max=1, ne.min=0.1, bstart=6, bend=4 ) # Calculate a grid for estimation sgrid <- makeGrid(coal_times = coaldat$coal_times, samp_times = samptv, Ngrid=nbndry) # Make data set for input to spmrf cdat <- make_coalescent_data(samp_times = samptv, n_sampled = nsampv, coal_times = coaldat$coal_times, grid = sgrid$grid)
We can take a look at what the population trajectory looks like by plotting it over the midpoints of the grid. Note that we reverse the coordinates of the x-axis so that we go backwards in time from the present.
# Calculate trajectory over the grid truetraj <- bottleNeck_traj(t = sgrid$midpts, lower_bound = 0.1, ne.max=1, ne.min=0.1, bstart=6, bend=4) # Plot plot(sgrid$midpts, truetraj, xlim=rev(range(sgrid$midpts)), ylim=c(0, 1.2), type="l", col="red", xlab="Time before present", ylab="Effective population size")
Next we will fit the model using two different priors for the effective population size. The first prior is the standard Gaussian Markov random field (GMRF) and the second is the horseshoe Markov random field (HSMRF). The first step is to calculate a reasonable value for the hyperparameter zeta
, which controls the scale of the half-Cauchy prior that controls the global level of smoothing in the models. We will use the same value of zeta
for both models. We then set up the MCMC settings and run the models. Note that once spmrf
builds the model code and calls stan, the model code gets transfered to C++ and compiled. This process takes about 30 seconds for each model. Note that stan
may show some warnings at the beginning which can be ignored. For the GMRF model, once the code compilation was complete, each chain took approximately 45 seconds to run on my machine, for a total run time of about 4 minutes. For the HSMRF model, each chain took about 2-3 minutes to run, for a total run time of about 12 minutes. Once the sampling is complete for each model, we then capture the posterior samples and get posterior summaries for the effective population size parameters.
# Set hyperparameter for global scale zeta <- set_zeta_phylo(phylo = coaldat, ncell = 100, alpha = 0.05, order = 1) # Parameters to keep pars.G <- c("theta", "gam") pars.H <- c("theta", "tau", "gam") # MCMC settings nchain <- 4 #number of chains ntotsamp <- 2000 #total number of samples to keep across all chains nthin <- 2 #thinning level nburn <- 1000 #warm-up / burn-in iterations per chain niter <- (ntotsamp/nchain)*nthin + nburn #total iterations to run # Run models fit.G <- spmrf(prior="normal", likelihood="coalescent", order=1, data=cdat, par=pars.G, chains=nchain, warmup=nburn, thin=nthin, iter=niter, control=list(adapt_delta=0.98, max_treedepth=14), zeta=zeta) fit.H <- spmrf(prior="horseshoe", likelihood="coalescent", order=1, data=cdat, par=pars.H, chains=nchain, warmup=nburn, thin=nthin, iter=niter, control=list(adapt_delta=0.995, max_treedepth=14), zeta=zeta) # Extract posterior draws pout.G <- as.array(fit.G) pout.H <- as.array(fit.H) # Get posterior summary for theta th.G <- extract_theta(fit.G, obstype="coalescent") th.H <- extract_theta(fit.H, obstype="coalescent")
Next we check some posterior diagnostics of the parameters. The print.stanfit
function (called with print
) from the rstan
package provides posterior summaries of the parameters and effective sample size estimates and Gelman-Rubin statistics. We use the plot_trace
function from the spmrf
package to check the traces of a few parameters for the horseshoe model. Two of those plots are shown below.
# Print parameter summaries print(fit.G, pars=pars.G) print(fit.H, pars=pars.H) # Some example trace plots for the horseshoe model plot_trace(pout.H, "theta[10]", pscale="original", stack=TRUE, colset="black") plot_trace(pout.H, "tau[10]", pscale="log", stack=TRUE, colset="black") plot_trace(pout.H, "gam", pscale="log", stack=TRUE, colset="black")
Finally we will plot the posterior median trajectories and 95% credible intervals for each model and compare them to the true trajectory that generated the data. We use the plot_trend
function from the spmrf
package to plot the posterior medians and 95% credible intervals for the two models. See the documentation for plot_trend
for more details on the available options. The code below produces a PNG file, but you can simply pull out the calls to plot_trend
for your own use.
It is clear from the plots that the GMRF model cannot account for the rapid changes in population size without comprimising for the smoothness of the field elsewhere. The HSMRF does a better job in this regard and has narrower credible intervals.
xrng <- rev(range(sgrid$midpts)) yrng <- c(0,1.8) png(filename='phylo_bottleneck_posterior_plots.png', width=1500, height=600, res=200) par(mfrow=c(1,2), mar=c(2,1.5,1.5,1), oma=c(2,2,0,0)) plot_trend(theta=th.G, obstype="coalescent", xvar=sgrid$midpts, main="GMRF-1", xlab="", ylab="", xlim=xrng, ylim=yrng) lines(sgrid$midpts, truetraj, lwd=2, lty=2, col="red") plot_trend(theta=th.H, obstype="coalescent", xvar=sgrid$midpts, main="HSMRF-1", xlab="", ylab="", xlim=xrng, ylim=yrng) lines(sgrid$midpts, truetraj, lwd=2, lty=2, col="red") legend(x="topright", legend=c("Median", "95% BCI", "Truth"), col=c("blue","lightblue", "red"), lwd=3, lty=c(1,1,2), bty="n", cex=0.8) mtext(side=1, outer=T, line=1, text="Time before present", font=2, cex=0.8) mtext(side=2, outer=T, line=1, text="Effective population size", font=2, cex=0.8) dev.off()
This example uses coalescent data from a fixed genealogy estimated from sequence data from Egypt. This is similar to the full analysis done on sequence data in Faulkner et al. (2018). See Faulkner et al. (2018) or the documentation for the hcv
data in the spmrf
package for a detailed description of the data and relevant references.
First we read in the summarized coalescent data. These data were created by summarizing a maximum clade credibility tree and consist of sample times (all at time 0, which corresponds to the year 1993), number sampled at each time, and the coalescent times. Alternatively, one could start with a tree and pull out the same information using tools in the phylodyn
package or the ape
package. After the data are read in, we create a grid with 75 cells and create the data list for input into the spmrf
function.
# load the hcv data data(hcv) # turn into a data list hcvlist <- list(samp_times=hcv$samp_times[1], n_sampled=hcv$n_sampled[1], coal_times=hcv$coal_times) h.ngrid <- 75 # number of grid cells hgrid <- makeGrid(hcvlist$coal_times, hcvlist$samp_times, Ngrid=(h.ngrid+1)) # make spmrf data set hdat <- make_coalescent_data(samp_times=hcvlist$samp_times, n_sampled=hcvlist$n_sampled, coal_times=hcvlist$coal_times, grid=hgrid$grid)
Next we fit the GMRF and HSMRF models of order 1 to the hcv data. Make sure that the rstan
package is installed and loaded before proceding. First we calculate a value for the hyperparameter zeta
, as was done in the previous example. This time we set the probability of exceeding the variance in the skyline estimates to be 0.01 instead of the 0.05 used in the previous example. This is because the skyline estimates for this example are quite variable. We then set up the MCMC settings and run the models. The GMRF model should complete in a few minutes, and the HSMRF model will take several minutes. The HMC control parameter adapt_delta
is made more strict (closer to 1) for the HSMRF for this example because there is a rather large period of time with no coalescent events in the data, which results in the potential for very rapid increases in effective population size estimates during that period. This creates a more difficult posterior for HMC to traverse for the HSMRF, and so stricter adaptation is needed. There are still quite a few "divergent transitions" that result with these settings, so more strict values of adapt_delta
, such as 0.999, could be tried, but the effect of the divergent transistions is minimal in this case. Once the sampling is complete for each model, we then capture the posterior samples and get posterior summaries for the effective population size parameters.
# Set hyperparameter for global scale zeta.h <- set_zeta_phylo(phylo = hcvlist, ncell = 75, alpha = 0.01, order = 1) # Parameters to keep pars.G <- c("theta", "gam") pars.H <- c("theta", "tau", "gam") # MCMC settings nchain <- 4 #number of chains ntotsamp <- 3000 #total number of samples to keep across all chains nthin <- 1 #thinning level nburn <- 1500 #warm-up / burn-in iterations per chain niter <- (ntotsamp/nchain)*nthin + nburn #total iterations to run # Run models fit.Gh <- spmrf(prior="normal", likelihood="coalescent", order=1, data=hdat, par=pars.G, chains=nchain, warmup=nburn, thin=nthin, iter=niter, control=list(adapt_delta=0.98, max_treedepth=15), zeta=zeta.h) fit.Hh <- spmrf(prior="horseshoe", likelihood="coalescent", order=1, data=hdat, par=pars.H, chains=nchain, warmup=nburn, thin=nthin, iter=niter, control=list(adapt_delta=0.995, max_treedepth=15), zeta=zeta.h) # Extract posterior draws pout.Gh <- as.array(fit.Gh) pout.Hh <- as.array(fit.Hh) # Get posterior summary for theta th.Gh <- extract_theta(fit.Gh, obstype="coalescent") th.Hh <- extract_theta(fit.Hh, obstype="coalescent")
As with the previous example, we can use the print
and plot_trace
functions to diagnose the posterior samples. We just show traces for a few parameters for the HSMF model, but obviously one could plot any parameter of interest.
# Print parameter summaries print(fit.Gh, pars=pars.G) print(fit.Hh, pars=pars.H) # Some example trace plots for the horseshoe model plot_trace(pout.Hh, "theta[17]", pscale="original", stack=TRUE, colset="black") plot_trace(pout.Hh, "tau[17]", pscale="log", stack=TRUE, colset="black") plot_trace(pout.Hh, "gam", pscale="log", stack=TRUE, colset="black")
Finally we will plot the posterior median trajectories and 95% credible intervals for each model using the plot_trend
function. The code below produces a PNG file, but you can simply pull out the calls to plot_trend
for your own use.
The plots show that the resulting median population trajectories are fairly similar, with the exception that the HSMRF is a little smoother in the increase over the most recent 50 years and does not show as much decrease in the decline approaching time zero. The biggest difference is obviously the uncertainty in population size captured by the HSMRF in the range of 100 to 200 years before the present. This is due to the lack of coalescent times during that period, which could be a data anomaly (more likely) or could be due to large population size at that time (less likely).
hxrng <- rev(range(hgrid$midpts)) hyrng <- range(th.Gh, th.Hh) png(filename='vignettes/figure/hcv_posterior_plots.png', width=1500, height=600, res=200) par(mfrow=c(1,2), mar=c(2,1.5,1.5,1), oma=c(2,2,0,0)) plot_trend(theta=th.Gh, obstype="coalescent", xvar=hgrid$midpts, main="GMRF-1", xlab="", ylab="", xlim=hxrng, ylim=hyrng, log="y") legend(x="topright", legend=c("Median", "95% BCI"), col=c("blue","lightblue"), lwd=3, bty="n", cex=0.8) plot_trend(theta=th.Hh, obstype="coalescent", xvar=hgrid$midpts, main="HSMRF-1", xlab="", ylab="", xlim=hxrng, ylim=hyrng, log="y") mtext(side=1, outer=T, line=1, text="Time before present", font=2, cex=0.8) mtext(side=2, outer=T, line=1, text="Effective population size", font=2, cex=0.8) dev.off()
We can take advantage of the loo
package (Vehtari et al. 2019) to calculate information criteria such as WAIC (Watanabe information criterion) and LOOIC (leave one out information criterion) for purposes of model comparison (see Vehtari et al. (2017) methods used in the loo
package). To do this, we need to explicitly calculate the log-likelihood values associated with the individual observational units in the model. Note that the default in spmrf
is not to calcualte or save these individual pointwise likelihood components due to the additional memory and processing time required. The loo
packages uses these individual likelihood components in the calculations of WAIC and LOOIC. We will demonstrate an example of this using the HCV data. Before moving forward you will need to install the loo
package if you have not already.
Building on the code used for the HCV example in the previous section, we simply need to tell the spmrf
function to calculate the individual log-likelihood values, which is done by specifying the argument save.loglik=TRUE, and specify the resulting "log_lik" variable in our list of parameters to keep. Here we fit both the first-order and second-order models to to the HCV data to allow a broader set of models to compare.
library(loo) # Set hyperparameters for global scale zeta.h1 <- set_zeta_phylo(phylo = hcvlist, ncell = 75, alpha = 0.01, order = 1) zeta.h2 <- set_zeta_phylo(phylo = hcvlist, ncell = 75, alpha = 0.01, order = 2) # Parameters to keep pars.Ghw <- c("theta", "gam", "log_lik") pars.Hhw <- c("theta", "tau", "gam", "log_lik") # MCMC settings nchain <- 4 #number of chains ntotsamp <- 3000 #total number of samples to keep across all chains nthin <- 1 #thinning level nburn <- 1500 #warm-up / burn-in iterations per chain niter <- (ntotsamp/nchain)*nthin + nburn #total iterations to run per chain # -------- Run models --------- # -- first-order fit.Gh1 <- spmrf(prior="normal", likelihood="coalescent", order=1, data=hdat, par=pars.Ghw, chains=nchain, warmup=nburn, thin=nthin, iter=niter, control=list(adapt_delta=0.995, max_treedepth=15), zeta=zeta.h1, save.loglik=TRUE) fit.Hh1 <- spmrf(prior="horseshoe", likelihood="coalescent", order=1, data=hdat, par=pars.Hhw, chains=nchain, warmup=nburn, thin=nthin, iter=niter, control=list(adapt_delta=0.995, max_treedepth=15), zeta=zeta.h1, save.loglik=TRUE) # -- second-order fit.Gh2 <- spmrf(prior="normal", likelihood="coalescent", order=2, data=hdat, par=pars.Ghw, chains=nchain, warmup=nburn, thin=nthin, iter=niter, control=list(adapt_delta=0.995, max_treedepth=15), zeta=zeta.h2, save.loglik=TRUE) fit.Hh2 <- spmrf(prior="horseshoe", likelihood="coalescent", order=2, data=hdat, par=pars.Hhw, chains=nchain, warmup=nburn, thin=nthin, iter=niter, control=list(adapt_delta=0.995, max_treedepth=15), zeta=zeta.h2, save.loglik=TRUE) # ------- Calculate WAIC ---------------- # extract log-likelihoods log_lik_G1 <- extract_log_lik(fit.Gh1) log_lik_H1 <- extract_log_lik(fit.Hh1) log_lik_G2 <- extract_log_lik(fit.Gh2) log_lik_H2 <- extract_log_lik(fit.Hh2) waic_G1 <- waic(log_lik_G1) waic_H1 <- waic(log_lik_H1) waic_G2 <- waic(log_lik_G2) waic_H2 <- waic(log_lik_H2) waic_G1$estimates waic_H1$estimates waic_G2$estimates waic_H2$estimates print(compare(waic_G1, waic_H1), digits = 2) print(compare(waic_G2, waic_H2), digits = 2) print(compare(waic_H1, waic_H2), digits = 2)
The results for this particular data set indicate that the order-1 GMRF model had a lower WAIC than the order-1 HSMRF model, but the order-2 HSMRF model had a lower WAIC than the order-2 GMRF model. When comparing HSMRF models, the order-2 would be slightly preferred over the order-1. Note that the estimated standard errors on these WAIC estimates are fairly large, so there is really no practical difference between models in terms of these metrics.
Faulkner, J.R., A.F. Magee, B. Shapiro, and V.N. Minin. 2018. Horseshoe-based Bayesian nonparametric estimation of effective population size trajectories. Preprint arXiv:1808.04401v2 [stat.ME]
Vehtari, A., A. Gelman, and J. Gabry. 2017. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing 27(5):1413-1432.
Vehtari, A., J. Gabry, Y. Yao, and A. Gelman. 2019. loo: efficient leave-one-out cross-validation and WAIC for Bayesian models. R package version 2.1.0. https://CRAN.R-project.org/package=loo.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.