In andrewzm/hmc: Hamiltonian Markov chain Monte Carlo

Introduction

In this vignette we will consider the exposition and toy problem of Neal (2011) in Section 5.3.3, and will be quoting from this work extensively throughout.

Hamiltonian Monte Carlo (HMC) is a Markov chain Monte Carlo (MCMC) method in which proposals are made efficiently by using the gradient of the density of interest. The gradient information allows us to propose a new state by simulating from Hamilton's equations. All that requires for the proposal is the current position (current sample) and a momentum vector which is assumed to be independent from the current sample.

For Hamiltonian Monte Carlo (HMC), we are required to write down a density of the form

$$ p(q,p) = \frac{1}{Z}\exp\left(\frac{-U(q)}{T}\right)\exp\left(\frac{-K(p)}{T}\right) $$

where $T$ is the system `temperature', $Z$ is the normalising constant, $U(q)$ is the potential energy (a function of the position $q$) and $K(p)$ is the kinetic energy (a function of the momentum $p$). Note that in this case, the product form is a result of the Hamiltonian being decomposed as a sum of the potential energy and the kinetic energy.

The position $q$ takes the role of the parameters and $U(q)$ the role of the negative log probability density function. Hence, if $L(q | Z)$ is the likelihood and $\pi(q)$ is the prior, we have that $$ U(q) = -\log(\pi(q)L(q|Z)) $$ One we have defined a function for the kinetic energy, and hence have also defined the Hamiltonian, we can simulated from it using the leapfrog method. The leaprfrog method, like any other discretisation scheme for approximately solving a continuous-time equation, requires specification of the step-size $\epsilon$. We can also choose how long a trajectory we wish to simulate through the number of leapfrog steps $L$. These parameters need to be carefully tuned with HMC specifically for each problem.

A bivariate toy problem

Consider the density defined above with $T=1$, $U(q) = q'\Sigma^{-1}q/2$ where $$\Sigma = \begin{bmatrix} 1 & 0.98 \ 0.98 & 1 \end{bmatrix}$$ and $K(p) = p'p/2$. The latter implies that $\exp\left(-K(p)\right)$ is a bivariate normal distribution with independent components of unit variance. The model can then be set-up in $R$ as follows:

library(devtools)
load_all()
### Define model
S <- matrix(c(1,-0.98,-0.98,1),2,2)
n <- nrow(S)
Q <- chol2inv(chol(S))
cholQ <- chol(Q)
U <- function(q) 0.5 * crossprod(cholQ %*% q)
dUdq <- function(q) Q %*% q
M <- diag(1,n)

We will now set up the HMC sampler. As discussed above, there are two crucial parameters which need to be set by the user, $\epsilons$ and $L$. We set $\epsilon$ equal to the square-root of the smallest eigenvalue of the covariance matrix, as this value is is the standard deviation in the "most constrained" direction. We then set $L$ equals to the square-root of the largest eigenvalue divided by $\epsilon$. This value is an indication of the number of step sizes needed to traverse the "least constrained" direction. The sampler is set up to take an $\epsilon$-generator, that is, a function which when called returns a value for $\epsilon$. This allows for random step sizes as we demonstrate later on in this tutorial.

### Sampler parameters
E <- eigen(S)
eps_gen <- function() round(min(E$values),2)
L = as.integer(max(E$values)/eps_gen())
print(paste0("eps = ",eps_gen(),". L = ",L))
sampler <- hmc_sampler(U = U,dUdq = dUdq, M = M,eps_gen = eps_gen,L = L)

The function hmc_sampler returns a sampler, which we then can insert in a loop as usual. In this case we will take $N = 1000$ samples and set the initial sample $q^{(0)} = 0$. We then plot the samples.

### Now sample
N <- 1000
q <- matrix(0,n,N)
for(i in 2:N) q[,i] <- sampler(q = q[,(i-1)])
plot(t(q),ylim = c(-4,4),xlim=c(-4,4))

This is comparable to directly sampling from this distribution

samps <- t(chol(S)) %*% matrix(rnorm(N*n),n,N)
plot(t(samps),ylim = c(-4,4),xlim=c(-4,4))

The value of $L$ is quite large in this case, due to the large disparity between the least constrained and most constrained direction. This can be improved by approximate pre-whitening, which we can do whenever we have a suitable approximation to the covariance matrix of interest (for example following a Laplace approximaion). We can pre-whiten by defining a new position variable $q' = L^{-1}q$, where $L$ is the lower Cholesky factor of the covariance matrix $\Sigma$. We can keep the same kinetic energy function, but the new potential energy function now works with the pre-whitened variables, and we have that $U'(q') = U(Lq')$. Alternatively, as shown in Neal (2011), Section 5.4.1, we can simply set the matrix $M$ which appears in the kinetic energy function to be the inverse of the covariance matrix, that is, $\Sigma^{-1}$. If we have correctly pre-whitened, then we can set $\epsilon = 1$ since we are not constrained in any direction, and also $L = 1$, which reduces HMC to a Langevin method.

### Sampler parameters
eps_gen2 <- function() 1
print(paste0("eps = ",eps_gen(),". L = ",L))
sampler <- hmc_sampler(U = U,dUdq = dUdq, M = solve(S),eps_gen = eps_gen2,L = 1L)

Since every sample now only requires one step the MC method is now probably a couple of orders of magnitude faster than the sampler which does not employ a transformation.

### Now sample
N <- 1000
q <- matrix(0,n,N)
for(i in 2:N) q[,i] <- sampler(q = q[,(i-1)])
plot(t(q),type="p",ylim = c(-4,4),xlim=c(-4,4))

Up to now we have used a function which returns a constant for $\epsilon$. Although not needed for this simple toy study, we can jitter $\epsilon$ if needed -- this is a good idea when different regions of the probability space have markedly different gradients, and hence would benefit from different step sizes for exploration. Jittering $\epsilon$ reduces the possibility of always rejecting proposals in these markedly different regions of the space.

eps_gen3 <- function() runif(n=1,min=0.8,max = 1.2)
sampler <- hmc_sampler(U = U,dUdq = dUdq, M = solve(S),eps_gen = eps_gen3,L = L)
q <- matrix(0,n,N)
for(i in 2:N) q[,i] <- sampler(q = q[,(i-1)])
plot(t(q),type="p",ylim = c(-4,4),xlim=c(-4,4))

As expected, for this problem we do not notice any change in performane due to jittering.

Adding constraints

Now we're going to consider the problem of sampling from the distribution under the constraint that both $q_1$ and $q_2$ are nonnegative. We use the 'billards' approach to cater for this -- if a particle crosses the zero threshold, it is bounced back in the leapfrog step. That is, it remains where it is with its momentum now reversed.

To introduce constraints, we only need to specify two vectors: lower and upper. In the following example we constrain $q_1 < 0$ and $q_2 > 0$, thus forcing the particles to lie in the upper-left quadrant in probability space. Since we can only pass vectors of constraints, we need to set arbitrarily large values for the 'dummy' constraints:

sampler <- hmc_sampler(U = U,dUdq = dUdq, M = M,eps_gen = eps_gen,L = L,lower=c(-1e10,0),upper=c(0,1e10))
N <- 1000
q <- matrix(0,n,N)
for(i in 2:N) q[,i] <- sampler(q = q[,(i-1)])
plot(t(q),type="p",ylim = c(-4,4),xlim=c(-4,4))