1. Introduction

1.1 Zig-zag sampling

The Zig-zag process can be used as an alternative to Markov Chain Monte Carlo to sample from a posterior distribution $\pi$. Its key properties are:

1.2 Background

Suppose the posterior to be over $d$ dimensions, and imagine we set a particle at arbitrary position $\xi_0 \in \mathbb{R}^d$ travelling at speed $1$ in direction $\theta_0 \in { \pm 1 }^d$. The only change in the motion of the particle we allow is that at certain times $T$, a single element of its current direction of travel $\theta(T^-)$ flips sign: for some $i$, $\theta_i(T) \leftarrow - \theta_i(T^-)$ - so the overall trajectory is piecewise linear.

The way these flip-times are chosen is as follows: let $t$ denote the time elapsed since the last flip (or since the beginning of the process, if no flips have yet occurred). Let the particle's location at $t$ be $\xi(t)$, and its direction $\theta$. With each index $i \in {1,\dots,d }$ is associated a clock ticking at rate $\lambda_i(\xi(t),\theta)$, in the sense that if $\tau_i$ is the time until clock $i$ rings, [\mathbb{P}( \tau_i > t) = \exp( - \int_0^t \lambda_i(\xi(s),\theta) \ ds).] If clock $i$ is the first to ring, at time $t=T$, then index $\theta_i$ flips at $T$. The particle changes direction, we reset $t \leftarrow 0$ and start the clocks again. The hope is that if the intensities $\lambda_i$ are chosen correctly, then the process is ergodic with invariant distribution $\pi$ over $\mathbb{R}^d$.

Happily, it turns out that under mild conditions on $\pi$ (including that it is continuously differentiable and everywhere positive), this is possible: writing $\pi(\xi) = 1/Z \cdot \exp( - \Psi(\xi))$ for a normalising constant $Z$, and taking [\lambda_i(\xi, \theta) = (\theta_i \partial_i \Psi(\xi))^+,] it can be shown that the process has invariant distribution $\pi$, and ergodic averages converge correctly. Moreover, it also satisfies the strong Markov property. A nice feature is that we need only know the posterior up to a constant factor to do this.

A Zig-zag process is totally specified by listing the flip-times ${T^i}{i=1}^\infty$, along with the positions ${\Xi^i}{i=1}^\infty$ at which the flips occur occur and the directions ${\Theta^i}{i=1}^\infty$ in which the process moves immediately after the flips. Call ${(\Xi^i, \Theta^i, T^i)}{i=1}^\infty$ the $\textit{skeleton}$ of the process. Generating the (truncated) process is then essentially equivalent to generating a sequence of the skeleton.

1.3 The Zig-zag package

The Zig-zag package offers a set of tools to:

The main restriction is that the distribution must be of such a form that one of the bounding conditions described in section 2.3 holds.

1.3.1 The skeleton function

The core of the Zig-zag package is the skeleton function. Given initial parameters, along with the derivatives $\partial_i \Psi(\xi)$ associated with the target distribution, and some related bounds (see 2.3), it produces a skeleton of the Zig-zag process. This is returned as an object of class "zz", which is a list with four named elements:

All the other tools take a skeleton as an argument. Example syntax for a 2D Cauchy distribution is:

devtools::load_all()
#find partial derivatives
del_cauchy2D <- function(x){
  return(3*x/(1+sum(x^2)))
} 

#produce skeleton
test0 <- skeleton(xi = c(1, 1), theta = c(1, 1), n = 100, derivatives = del_cauchy2D, 
                 bounds = c(1.5, 1.5), bound_type = "global")

There are special summary and plot methods for the "zz" class (2D plots are illustrated in section 2.3):

summary(test0)

Here ESS is an estimate of the effective sample size.

del_cauchy1D <- function(x){
  2*x/(1+x^2)
}

test1 <- skeleton(xi = c(1), theta = c(1), n = 100, derivatives = del_cauchy1D, bounds = c(1))
plot(test1)

2. Implementation

2.1 The skeleton algorithm

Section 1.2 suggests a prototype algorithm to generate the skeleton of a Zig-zag process targeting the distribution $\pi(\xi)\propto\exp(-\Psi(\xi))$: given initial values $\Xi^{(0)}, \Theta^{(0)}$, for $k=1,2,...$,

1) Draw $\tau_1$,...,$\tau_d$ such that [ \mathbb{P}(\tau_i \ge t)=\exp(-\int_0^t \lambda_i(\Xi^{(k-1)} + s \Theta^{(k-1)},\Theta^{(k-1)})ds) ] 2) Let $i_0=\operatorname{argmin}(\tau_1,...,\tau_d)$ and $\tau=\tau_{i_0}$ 3) Set $\Xi^{(k)}=\Xi^{(k-1)}+\tau\Theta^{(k-1)}$ and $\Theta^{(k)}i=\Theta^{(k-1)}_i$ for $i\neq i_0$, $\Theta^{(k)}{i_0}=-\Theta^{(k-1)}_{i_0}$

Drawing the first arrival time of an inhomogenous Poisson process is generally not easy. We'll turn to how this is done next.

2.2 Sampling flip-times

Consider a single linear sub-path of the process starting from the skeleton point $(\Xi^k,\Theta^k,T^k)$. It is not clear how to sample directly from the distributions of the $\tau_i$; however, sampling from similarly specified distributions such as [ \mathbb{P}( \tau > t) = \exp(- \int_0^t M(s) ds) ] where $M(t)$ is "simple" function (in the sense that step 2 below can be carried out) is easy:

1) calculate $\int_0^t M(s) \ ds$ 2) find $\textit{explicitly}$ the generalised inverse $G(y) = \inf { t \geq 0: \int_0^t M(s) \ ds \geq y }$ 3) sample $\tau = G( - \log U)$ for $U \sim U[0,1]$.

This idea can be used to sample exactly from the true distribution of the first clock $\tau_{i_0}$ directly, by a thinning procedure analogous to the usual thinning for homogeneous Poisson processes. Suppose we can find simple bounds $M_i(t)$ with $M_i(t) \geq m_i(t) := \lambda_i(\xi(t), \Theta)$. Then

1) sample $\tilde \tau_i$ as clocks with rates $M_i$, as described above 2) find $i_0 = \arg \min_i \tilde \tau_i$ 3) accept $\tilde \tau_{i_0}$ as the time until the next flip with probability $m_{i_0} (\tilde \tau_{i_0})/M_{i_0} (\tilde \tau_{i_0})$.

This completes the full procedure for generating the skeleton, provided the bounds $M_i$ are available.

2.3 Useful bounds

2.3.1 Global constant bounds

The easiest possible case is that the intensities $\lambda_i(\xi,\theta)$ are uniformly bounded by a constant, that is there exist $c_i$ with $\sup_{\xi} | \partial_i \Psi(\xi) | < c_i$. Then we can take $M_i = c_i$, and the samples from the $M_i$-clocks are just exponential rate $c_i$, which pose no difficulty to simulate.

Example 1

Consider targetting a 2-dimensional Cauchy distribution, so $\pi(\xi)\propto (1+\xi^T\xi)^{-1.5}$. The required intensities are $(\theta_i\partial_i\Psi(\xi))^+ = (3 \theta_i \xi_i/(1+\xi^T\xi ) )^+$, which means we can run the algorithm using the constant bounds $M_i(t)=1.5$ for $i=1,2$:

#define Cauchy density
cauchy <- function(x){
  log(1/(1+sum(x^2))^(1.5))
}

#find partial derivatives
del_cauchy <- function(x){
  return(3*x/(1+sum(x^2)))
}

set.seed(888)

test <- skeleton(c(1, 1), c(1, 1), 100, del_cauchy, bounds = c(1.5, 1.5), 
                 bound_type = "global")
plot(test, contour = TRUE, cauchy)

2.3.2 Hessian bounds

A more general and useful case is that we can find a positive-definite $Q$ which $\textit{dominates}$ the Hessian of $\Psi$, in the sense that $Q \succcurlyeq H_\Psi(\xi)$. In this case, it is easy to find constants $a_i$ and $b_i$ such that $m_i(t) \leq (a_i + b_i t)^+ := M_i(t)$: [ \theta_i\partial_i\Psi(\xi+\theta t) = \theta_i\partial_i\Psi(\xi)+\int_0^t \sum_{j=1}^d\partial_i\partial_j\Psi(\xi+\theta s)\theta_jds \le \underbrace{\theta_i\partial_i\Psi(\xi)}\text{$a_i$} + \underbrace{\int_0^t ||Qe_i||_2||\theta||_2 ds}\text{$b_i$}. ] The generalized inverse can then be computed exactly as $G(y)=-(a/b)+\sqrt{(a/b)^2+(2y/b)}$.

Example 2

Suppose we want samples from the posterior $\pi(\xi|x_{1:n})$, where $x_1,...,x_n\in\mathbb{R}^d$ are i.i.d. data generated from $f(x_j|\xi)\sim\mathcal{N}(\xi,\Sigma)$, and $\pi\sim\mathcal{N}(0,\rho^2I)$ is a prior on the parameter. The Hessian of the negative log density is easily caluclated as [ H_{\Psi}(\xi) = (1/\rho^2) I + n\Sigma^{-1} ] independent of the state $\xi$, so we can choose $Q=H_{\Psi}$. We apply the algorithm with $d=2$, $\rho=1$, $\Sigma_{11} = 1, \ \Sigma_{12} = \Sigma_{21} = 1/2, \ \Sigma_{22} = 2$, and $n=1000$ datapoints generated from $\mathcal{N}(0,\Sigma)$.

library(MASS)
library(mvtnorm)

n <- 1000
set.seed(888)

#define intensities and dominating Hessian
data <- mvrnorm(n, mu = c(0, 0), Sigma = matrix(c(1, 0.5, 0.5, 2), nrow = 2))
grad <- function(x){
  return(x - matrix(c(8/7, -2/7, -2/7, 4/7), nrow = 2)%*%(apply(data, 2, sum) - n*x))
}
Q <- matrix(c(1000*8/7+1, -1000*2/7, -1000*2/7, 1000*4/7+1), nrow = 2)

test2 <- skeleton(c(0, 0), c(1, 1), 500, grad, bounds = Q, bound_type = "hessian")

#define posterior density for contour plot
posterior <- function(x){
  return(dmvnorm(x, log = T) + sum(dmvnorm(data, mean = x, 
              sigma = matrix(c(1, 0.5, 0.5, 2), nrow = 2), log = T)))
}

plot(test2, contour = TRUE, f = posterior)

If $n$ is large it is generally expensive to evaluate the product $\prod_{j=1}^nf(x_j|\xi)$ (this is not an issue in this case since we would only need to precompute $\sum_{j} x_j$). The next type of bound, as well as the global constant bounds, can be used in a sub-sampling procedure which avoids the evaluation of the full likelihood, as we'll see next.

2.3.3 Lipschitz bounds

The flip times can also be sampled easily if the partial derivatives of $\Psi$ are globally and uniformly Lipschitz continuous, that is if there exist $p \in [1, \infty]$ and constants $C_i \geq 0$ such that [ |\partial_i\Psi(\xi)-\partial_i\Psi(\xi')|\le C_i||\xi-\xi'||_p ] for all $i=1,...,d$, and for every $\xi,\xi'\in \mathbb{R}^d$. A short computation then shows that [ m_i(t)\le (\theta_i\partial_i\Psi(\xi^))^++C_i(||\xi-\xi^||_p+t||\theta||_p)=:M_i(t) ] for any reference point $\xi^*$. We can use these bounds to sample to flip-times the same way as in the case of a Hessian bound.

Two choices (which will typically affect how good the bounds are) have to be made:


3. Zig-Zag with sub-sampling

Suppose that we can write [ \partial_i\Psi(\xi)=\frac{1}{n}\sum_{j=1}^nE_i^j(\xi) ] for some continuous $E_i^j$ (as we can for an appropriate Bayesian posterior, for example). Then the Zig-Zag process with intensities [ \lambda_i(\xi,\theta)=\frac{1}{n}\sum_{j=1}^n(\theta_iE_i^j(\xi))^+ ] has the correct target distribution; moreover, if we have bounds [M_i(t)\ge m_i^j(t) := (\theta_iE_i^j(\xi+\theta t))^+ ] for all $j=1,...,n$, the process can be simulated by evaluating only $\textbf{one}$ of the $E_i^j$ in each iteration. This makes subsampling a very efficient method for sampling from posterior distributions, since the likelihood only needs to be evaluated for one data point in each iteration.

Example 2 (continued)

Returning to the Gaussian likelihoods, we can use the Lipschitz bounds (using the Hessian bounds defeats the purpose of sub-sampling, since obtaining bounds $M_i(t)\ge m_i^j(t)$ for all $j=1,...,n$ would actually require the computation of the likelihood for all data points). Ignore the prior for notational simplicity, and take $p=\infty$. Since [ \nabla\log f(x_j|\xi) = \Sigma^{-1}(x_j-\xi) ] the Lipschitz condition is satisfied for $C_i=1/\rho^2 + n||\Sigma^{-1}||{\infty}$ by submultiplicativity (the term $1/\rho^2$ is due to the prior). We can exactly compute the posterior mode and choose the reference point [ \xi^* = \Big(\frac{1}{n\rho^2}I+\Sigma^{-1}\Big)^{-1}\Sigma^{-1}\sum{j=1}^{n}x_j, ] which requires a one-off cost of $O(n)$.

4. Sampling and ergodic averages

In this section we'll look at how to turn the skeleton into something useful for posterior inference.

4.1 Sampling

Given a skeleton ${(\Xi^i, \Theta^i, T^i)}_{i=1}^N$, if we want $k$ samples from $\pi$ we can just sample from the Zig-zag process at times $m \cdot T^N / k$, $m=1,\dots,k$. Importantly, the skeleton points themselves are $\textit{not}$ samples from $\pi$ - they are obviously biased towards the tails in every direction. The get_samples function is used to generate such samples:

get_samples(skeleton = test2, k = 5)

4.2 Ergodic averages

One way to estimate ergodic averages is using samples $\xi_i$ obtained as above, and estimating [ \widehat{\pi f} = \frac{1}{k} \sum_{m=1}^k f(\xi_i). ] The zz_integrate function, with the default averaging = "discrete" option, does this:

add_and_exp <-  function(a, n){
  sum(a)^n
}

zz_integrate(f = add_and_exp, skeleton = test2, averaging = "discrete", k = 500, n = 2)

Note that arguments for $f$ which we don't need to integrate over (such as $n$ above) can be passed to $f$ by specifying them at the end. It is best to name them explicitly so zz_integrate does not confuse them for the sample size $k$.

Alternatively, to avoid an arbitrary choice of sample size, we can use continuous time ergodicity to estimate

[ \widehat{\pi f} = \frac{1}{T^N} \int_0^{T^N} f(\xi(s)) \ ds = \frac{1}{T^N} \sum_{i=0}^{k-1} \int_{T^{i}}^{T^{i+1}} f(\Xi^i + \Theta^i(s - T^i)) \ ds. ]

In principle, the integrals can be calculated explicitly if $f$ is of a nice form, or else estimated numerically by any common method. The averaging = "continuous" option in zz_integrate implements this using the base-R integrate function.

zz_integrate(add_and_exp, test2, averaging = "continuous", n = 2)

5. Scaling

In this section we appeal to central limit theorem-type results to compare the costs of producing new skeleton samples at equilibrium for the vanilla and sub-sampling ("mango and passion fruit"?) versions of the Zig-zag sampler. The key point is that Zig-zag with sub-sampling is heuristically super-efficient.

5.1 Basic Zig-zag

Let $\hat \xi$ be the MLE for $\xi$ for the model with i.i.d. data $x_1,\dots,x_n \sim f(\cdot| \xi_0)$; so $\Psi(\xi) = - \sum_j \log f(x_j|\xi)$. For $\xi$ near to $\hat \xi$, after Taylor expanding, putting $\phi = \sqrt{n}(\xi - \hat \xi)$, and rescaling time by $\sqrt{n}$ so that the speed of the process remains $1$, we find that the intensities are [ \lambda_i( \phi, \theta) = - \Big( \frac{1}{n} \sum_j \sum_k \partial_i \partial_k \log f(x_j| \hat \xi) \phi_k \Big)^+ + O(n^{-1/2}).] Taking $n \to \infty$, it can be shown that $\lambda_i( \phi, \theta) \to (\theta_i \cdot [i(\xi_0)\phi]_i)^+$ almost surely, where $i(\xi_0)$ is the (true) Fisher information. The fact that dependence on $n$ disappears means that we can obtain independent samples from the process at rate $O(1)$.

For example, with global bounds $a = 1$ since each term in $\Psi$ take $O(1)$ to calculate.

5.2 Zig-zag with sub-sampling

A similar (but more involved) argument shows that: 1) independent examples can again be drawn at rate $O(n^{-1/2})$; 2) cost of calculating the bounds is again $O(n^{1/2})$; and 3) the cost of proposing a new skeleton point is only $O(1)$ by design, resulting in an overall $O(1)$ complexity. This is the super-efficiency property.

Example 3

Consider again a Bayesian setting where we are interested in obtaining samples from a posterior $\pi(\xi|x_{1:n})\propto \prod_{j=1}^n f(x_j|\xi)$ (assume for notational simplicity an improper prior). We want to test the scalability of Zigzag with subsampling in a setting with a different prior than a normal. Therefore we define a distribution $f(x|\xi)\propto \exp(-\frac{1+100(x-\xi)^2}{1+||x||_2})$. Data simulated from this distribution with $\xi=0\in\mathbb{R}^2$ is shaped like a doughnut with a hole at the origin, so if we naively chose $E_i^j(\xi)=\partial_i f(x_j|\xi)$ then the intensities $m_i^j$ would vary vastly in $j$ resulting in bad bounds $M_i$.

#First plot
MCMC <- function(K=1000, start=c(0,0)){
  B = matrix(NA, K, 2)
  mu = start
  B[1,] = mu
  lp = -psi_dough(mu)
  for(k in 2:K){
    mup = mu + 0.1*rnorm(2)
    lpp = -psi_dough(mup)
    MHR = lpp - lp
    if(log(runif(1))<MHR){
      mu = mup
      lp = lpp
    }
    B[k,] = mu
  }
  return(B)
}

psi_dough = function(x){
  (1 + 100*sum(x^2))/(0.01 + sqrt(sum(x^2)))
}
z = matrix(0,nrow=length(seq(-0.3,0.3,0.01)),ncol=length(seq(-0.3,0.3,0.01)))
for(i in 1:length(seq(-0.3,0.3,0.01))){
  for(j in 1:length(seq(-0.3,0.3,0.01))){
    z[i,j] = -psi_dough(c(seq(-0.3,0.3,0.01)[i],seq(-0.3,0.3,0.01)[j]))
  }
}

data = MCMC(1000, start=c(0,0.1))
filled.contour(
  x = seq(-0.3,0.3,length.out=nrow(z)),
  y = seq(-0.3,0.3,length.out=ncol(z)),
  z = z,
  plot.axes={points(data,cex=0.5); axis(1); axis(2); box()}
)

However, instead of directly using the individual likelihoods one can "smoothen" the components making up $\Psi=-\log\pi$ and choose $E_i^j(\xi) = \partial_i\Psi(\xi^) + \partial_i \log f(x_j|\xi) - \partial_i \log f(x_j|\xi^)$. A short computation shows $\nabla \log f(x_j|\xi) = \frac{2(\xi-x_j)}{1+||x_j||}$, so the Lipschitz condition is satisfied with bounds $C_i = 2n\max_j \Big|\frac{x_j^{i}}{1+||x_j||_2}\Big|$. If we choose the reference point $\xi^*$ to be the posterior mean of the data (which is easy to compute and requires $O(n)$), Zigzag with subsampling turns out to be superefficient also in this case, that is the effective sample size per epoch (one epoch is the time required for one evaluation of the full likelihood).

results = matrix(11:16, nrow=6, ncol=11)
for(repetition in 2:11){
  for(runner in 1:6){
    set.seed(42)
    n = 2^(10+runner)
    data = MCMC(n, start=c(0,0.1))

    #DERIVATIVES AND STUFF
    xi0 = apply(data, 2, mean)
    d_ref = apply(2*(xi0 - t(data))/(1+sqrt(apply(data^2,1,sum))), 1, sum)

    derivatives = vector("list", n)
    for(i in 1:n){
      closure <- function(values) {
        force(values)
        function(x) {
          return(d_ref + 2*n*(x-xi0)/(1+sqrt(sum(values^2))))     
        }
      }
      derivatives[[i]] <- closure(data[i,])
    }

    bounds = 2*n*apply(abs(data/(1+sqrt(apply(data^2,1,sum)))), 2, max)

    test = skeleton(c(0,0.1), c(1,1), 10000, derivatives, bounds, bound_type = "lipschitz", subsample = TRUE, xi0 = xi0, p = Inf)
    results[runner,repetition] = log(ESS(mean, test, 100)/(test$time[10000]/test$time[2]))
  }
}

ess_res = apply(results,1,median)
covariates = 1:6
fitline = lm(ess_res~covariates)

boxplot(results[1,-1], results[2,-1],results[3,-1], results[4,-1],results[5,-1], results[6,-1], col="grey",
        ylab = "log(ESS/epochs), base = 2" ,xlab = "log(n), base = 2")
axis(1,at=1:6,labels=11:16)
abline(fitline$coefficients,col="red")

6. Limitations

The Zigzag sampler is a strong sampling method, especially in big data settings where the repeated evaluation of functions depending on the full data is infeasible. However, it does have its downsides:

Intuitively, (as many other algorithms) the Zigzag sampler will not be suitable for multimodal target distributions. Another issue might arise in the subsampling approach, in a setting with modes, which are not well separated, we expect Zigzag with subsampling to suffer from the fact that potentially multiple equally good choices for $\xi^*$ exist. However, in practice Zigzag is often difficult to apply, since knowledge about the gradient of the log-density is required. For example in a normal mixture model, this expressions takes a rather complicated form.

$~$

References

Bierkens, J., Fearnhead, P., and Roberts, G. (2016). The Zig-Zag Process and Super-Efficient Sampling for Bayesian Analysis of Big Data



fawuuu/Zigzag documentation built on May 28, 2019, 8:38 p.m.