set.seed(3)
library(MALECOT)

This vignette demonstrates model comparison in MALECOT. It covers:

  1. Running thermodynamic MCMC
  2. Checking convergence of thermodynamic MCMC
  3. Estimating K via generalized thermodynamic integration (GTI)
  4. Comparing parameter sets through the model evidence

This tutorial assumes some prior knowledge about MALECOT, so if you are completely new to the program we recommend working through the simpler bi-allelic tutorial first.

Background on Bayesian model comparison

When carrying out Bayesian analysis it is useful to make a distinction between two types of analysis:

  1. parameter estimation within a model
  2. comparison between models

As an example, we might create a model $\mathcal{M}$ in which we write down the probability of the observed data $x$ as a function of the unknown allele frequencies $p$. In other words, we write down the likelihood $\mbox{Pr}(x \: | \: p, \mathcal{M})$.

In Bayesian parameter estimation we are trying to get at the posterior probability $\mbox{Pr}(p \: | \: x, \mathcal{M})$. We typically do this using MCMC, which produces a series of draws from the posterior distribution.

But what if we want to know the posterior probability of the model, rather than the parameters of the model? In other words, we want to know $\mbox{Pr}(\mathcal{M} \: | \: x)$. Calculating this quantity requires that we integrate over all unknown parameters:

$$\mbox{Pr}(\mathcal{M} \: | \: x) = \int \mbox{Pr}(\mathcal{M}, p \: | \: x) dp$$

This is often an extremely high-dimensional integral - for example in MALECOT there could easily be hundreds of unknown parameters - making it computationally infeasible by most methods. Regular MCMC cannot help us here because it only produces draws from the posterior distribution rather than normalised values.

One way around this is to use an advanced MCMC technique known as thermodynamic integration (TI). In TI we run multiple MCMC chains, each at a different "rung" on a temperature ladder. The hotter the chain, the flatter the target distribution. The log-likelihoods over all chains are then combined in a single calculation that - by what can only be described as mathematical magic! - is asymptotically equal to $\log[\mbox{Pr}(x \: | \: \mathcal{M})]$. We can then apply a prior over models, for example giving each model equal weight, to arrive at the desired posterior value $\mbox{Pr}(\mathcal{M} \: | \: x)$. Generalised thermodynamic integration (GTI) differs from regular TI in that it uses a slightly different calculation when combining information across rungs that leads to lower bias and higher precision.

Hopefully this is enough background to run thermodynamic MCMC in MALECOT, and to understand the results. For those eager to understand all the mathematical details, see this vignette.

Running a thermodynamic MCMC

For the sake of this tutorial we will use simulated bi-allelic data drawn from $K = 3$ subpopulations. We create a new project and bind this data:

# simulate data
mysim <- sim_data(data_format = "biallelic", n = 100, L = 24, K = 3)
# secretly save/load data from file so results are consistent
#saveRDS(mysim, file = "../ignore/inst/extdata/tutorial4_mysim.rds")
if (file.exists("../ignore/inst/extdata/tutorial4_mysim.rds")) {
  mysim <- readRDS("../ignore/inst/extdata/tutorial4_mysim.rds")
}
# create project and bind data
myproj <- malecot_project()
myproj <- bind_data_biallelic(myproj, df = mysim$data, ID_col = 1, pop_col = 2)

For our first parameter set we will assume a very basic model with default parameters; which means a Poisson prior on COI, a flat prior on allele frequencies and no error estimation:

# create parameter set
myproj <- new_set(myproj, name = "simple model")

We will then run the MCMC for values of $K$ from 1 to 5. What makes this thermodynamic MCMC rather than regular MCMC is the rungs argument, which dictates the number of rungs on the temperature ladder. We will opt for 10 rungs for now. Be warned - this MCMC will take considerably longer to run than regular MCMC, so be prepared to go make yourself a cup of tea!

# run thermodynamic MCMC
myproj <- run_mcmc(myproj, K = 1:5, burnin = 1e4, converge_test = 1e2,
                   samples = 1e4, rungs = 10, pb_markdown =  TRUE)
# secretly save/load data from file so results are consistent
#saveRDS(myproj, file = "../ignore/inst/extdata/tutorial4_myproj0.rds")
if (file.exists("../ignore/inst/extdata/tutorial4_myproj0.rds")) {
  myproj <- readRDS("../ignore/inst/extdata/tutorial4_myproj0.rds")
}

Notice that convergence times are much longer than under regular MCMC. This is because the MCMC is only deemed to have converged when every chain has converged, meaning we are only as strong as the weakest link. When running thermodynamic MCMC it is therefore important to keep an eye on convergence, and to increase the number of burn-in iterations if needed.

Checking thermodynamic MCMC behaviour

There are more moving parts in thermodynamic MCMC, meaning we have more things to check. First, we should perform the same diagnostic checks as for regular MCMC:

plot_loglike_dignostic(myproj, K = 3)

The plot_loglike_diagnostic() function uses the cold rung by default (in this case the 10th rung), or we can use the rung argument to produce plots for any given rung, for example:

plot_loglike_dignostic(myproj, K = 3, rung = 1)

Running get_ESS() now prints out the effective sample size of every rung:

get_ESS(myproj, K = 3)

We interpret ESS for hot chains the same way as for the cold chain - as the number of independent samples that we have obtained from the target distribution once autocorrelation has been accounted for. If we see small values of the ESS (less than one thousand as a rule of thumb) then we should be concerned that that particular rung has has not explored the space well, and we should repeat the analysis with a larger number of samples. For the sake of this tutorial we will load in the result of re-running this MCMC with samples = 1e5, which took around 20 minutes.

# re-run MCMC
#myproj <- run_mcmc(myproj, K = 1:5, burnin = 1e4, converge_test = 1e2, samples = 1e5, rungs = 10)

# secretly save/load data from file so results are consistent
#saveRDS(myproj, file = "../ignore/inst/extdata/tutorial4_myproj.rds")
if (file.exists("../ignore/inst/extdata/tutorial4_myproj.rds")) {
  myproj <- readRDS("../ignore/inst/extdata/tutorial4_myproj.rds")
}

A useful plotting function when working with multiple rungs is the plot_loglike() function, which plots the 95% quantiles of the log-likelihood over every rung:

plot_loglike(myproj, K = 3)

This plot should be always-increasing from left to right, aside from small variations due to the random nature of MCMC. If any rungs stand out from the overall pattern then it is worth running plot_loglike_diagnostic() on these particular rungs, as they may have mixed badly. As with all other checks, this should be performed on every explored value of $K$.

The GTI method takes the log-likelihood values above and multiplies them by weights, leading to a GTI "path". The area between this path and the zero line is our estimate of the log-evidence. We can visualise this path using the plot_GTI_path() function:

plot_GTI_path(myproj, K = 3)

As our evidence estimate is derived from the area between the line and zero, it is important that this path is not too jagged. We are essentially approximating a smooth curve with a series of straight line segments, so if the straight lines cut the corners off the smooth curve then the area will be too large or too small, leading to a biased estimate. There are two ways that we can mitigate this:

  1. Increase the number of rungs
  2. Change the GTI_pow argument

Increasing the number of rungs will obviously lead to a smoother path, and it will also help with MCMC mixing, but it comes at the cost of slowing down the MCMC. The GTI_pow argument changes the curvature of the path by modifying the log-likelihood weights, with large values leading to a more concave path. Ideally we want to choose GTI_pow such that the path is as straight as possible, as this will lead to smallest difference between the true curve and the discrete approximation. The plot above is slightly concave using the default value GTI_pow = 3, but it is not pathalogically concave. If we were producing results for publication and had a computer sitting idle overnight then we could consider re-running the MCMC with more rungs and a lower GTI_pow, but otherwise these results are fine.

Plotting results

Once we are happy with our log-likelihood estimates and our GTI path, we can use our log-evidence estimates to compare values of $K$. The raw estimates can be plotted using the plot_logevidence_K() function, which plots 95% credible intervals of the log-evidence for each value of $K$:

plot_logevidence_K(myproj)

We can see immediately that the model favours $K = 3$ in this case, which agrees with our simulation parameters. If we had seen large credible intervals at this stage then we could re-run the model with a larger number of samples, but in this example the intervals are non-overlapping and the signal is clear so there is no need.

We can also use the plot_posterior_K() function to plot the posterior probability of each value of $K$, which is obtained by taking these raw estimates out of log space and applying an equal prior over $K$:

plot_posterior_K(myproj)

This second plot is often easier to interpret than the first as it is in units of ordinary probability. In this case we can see that there is a >99% posterior probability that the data were drawn from $K = 3$ subpopulations (which we know to be true). Again, if we saw large credible intervals at this stage then it would be worth re-running the MCMC with a larger number of samples, but here there is no need.

Comparing different parameter sets

One of the major advantages of the model evidence is that we can use it to compare wider evolutionary models, i.e. different parameter sets. Mathematically this is as simple as summing the evidence for each value of $K$, weighted by the prior.

We can test this by creating a second parameter set that differs from the first in that in that we now estimate the error terms:

# create new parameter set
myproj <- new_set(myproj, name = "error model", estimate_error = TRUE)
myproj

We then need to run thermodynamic MCMC for this parameter set over the same range of $K$ values. For the sake of this tutorial we will save time by loading results from file:

# annoyingly need to run at least some MCMC to pass travis checks. This output
# will be overwritten when run locally
myproj <- run_mcmc(myproj, K = 1:5, burnin = 1e1, converge_test = 1e2, samples = 1e1, rungs = 10, silent = TRUE)

# re-run MCMC
#myproj <- run_mcmc(myproj, K = 1:5, burnin = 1e4, converge_test = 1e2, samples = 1e5, rungs = 10)

# secretly save/load data from file so results are consistent
#saveRDS(myproj, file = "../ignore/inst/extdata/tutorial4_myproj2.rds")
if (file.exists("../ignore/inst/extdata/tutorial4_myproj2.rds")) {
  myproj <- readRDS("../ignore/inst/extdata/tutorial4_myproj2.rds")
}
# uncomment to run MCMC
#myproj <- run_mcmc(myproj, K = 1:5, burnin = 1e4, converge_test = 1e2, samples = 1e5, rungs = 10)

# plot diagnostics
plot_loglike_dignostic(myproj, K = 3)

# report ESS
get_ESS(myproj, K = 3)

Assuming we are happy with the MCMC behaviour we can go ahead and plot the posterior error estimates:

plot_e(myproj, K = 3)

We know from the simulation parameters that there are no errors in the data, meaning this extra flexibility of estimating the error is actually taking the model further away from the truth. Looking at the posterior credible intervals we can see that the model has struggled to arrive at precise estimates of the error rates, instead exploring the full prior range $[0,0.2]$ quite evenly.

We continue with the same analyses as in the previous section:

# use gridExtra package to arrange plots
library(gridExtra)

# produce plots and arrange in grid
plot1 <- plot_loglike(myproj, K = 3)
plot2 <- plot_GTI_path(myproj, K = 3)
plot3 <- plot_logevidence_K(myproj)
plot4 <- plot_posterior_K(myproj)
grid.arrange(plot1, plot2, plot3, plot4)

As with the previous model, the evidence is in favour of $K = 3$ in this example.

Finally, we come to the issue of comparing parameter sets. Just as with the evidence estimates over $K$, we can plot the log-evidence of a parameter set using the plot_logevidence_model() function, or the posterior distribution over models assuming an equal prior using the plot_posterior_model() function:

# produce evidence plots over parameter sets
plot1 <- plot_logevidence_model(myproj)
plot2 <- plot_posterior_model(myproj)
grid.arrange(plot1, plot2, nrow = 1)

Here we can see that the former, simpler model has a higher evidence than the more complex error estimation model. To understand this, we can imagine the MCMC exploring the parameter space under each model. In the simple model the error is fixed at the correct value of zero, while in the more complex model the MCMC has a finite chance of exploring different error values, all of which are sub-optimal in terms of the likelihood. So the likelihood will, on average, be higher under the simple model. In other words, in the complex model we have introduced additional flexibility that has not been compensated for by a commensurate increase in the likelihood, and so the model evidence naturally punishes the model for being over-fitted. In this example we would conclude that a zero-error model is a better fit to the data, and so we would most likely only report detailed results from this model.

It is clear from this tutorial that thermodynamic MCMC over multiple temperature rungs can take considerably longer to run than regular MCMC. Hence, the next tutorial descibes how to run MALECOT in parallel over multiple cores.



bobverity/MALECOT documentation built on May 13, 2019, 4:01 a.m.