knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
par(mar=c(5, 4.5, 4, 2)+0.1, bty="l")


#rmarkdown::render("~/Work/MCMCglmm/vignettes/articles/_2.Bayes_and_MCMC.Rmd")

Bayesian Statistics & MCMC

#library(MCMCpack)

There are fundamental differences between classical and Bayesian approaches, but for those of us interested in applied statistics the hope is that these differences do not translate into practical differences, and this is often the case. My advice would be if you can fit the same model using different packages and/or methods do so, and if they give very different answers worry. In some cases differences will exist, and it is important to know why, and which method is more appropriate for the data in hand.

In the context of a generalised linear mixed model (GLMM), here are what I see as the pro's and cons of using (restricted) maximum likelihood (REML) versus Bayesian Markov chain Monte Carlo (MCMC) Bayesian methods. REML is fast and easy to use, whereas MCMC can be slow and technically more challenging. Particularly challenging is the specification of a sensible prior, something which is a non-issue in a REML analysis. However, analytical results for non-Gaussian GLMM are generally not available, and REML based procedures use approximate likelihood methods that may not work well. MCMC is also an approximation but the accuracy of the approximation increases the longer the analysis is run for, being exact at the limit. In addition REML uses large-sample theory to derive approximate confidence intervals that may have very poor coverage, especially for variance components. Again, MCMC measures of confidence are exact, up to Monte Carlo error, and provide an easy and intuitive way of obtaining measures of confidence on derived statistics such as ratios of variances, correlations and predictions.

To illustrate the differences between the approaches lets imagine we've observed several random deviates (${\bf y}$) from a standard normal (i.e. $\mu=0$ and $\sigma^{2}=1$). The likelihood is the probability of the data given the parameters:

$$ Pr({\bf y} | \mu, \sigma^{2}) $$

This is a conditional distribution, where the conditioning is on the model parameters which are taken as fixed and known. In a way this is quite odd because we've already observed the data, and we don't know what the parameter values are. In a Bayesian analysis we evaluate the conditional probability of the model parameters given the observed data:

$$ Pr(\mu, \sigma^{2} | {\bf y}) $$

which seems more reasonable, until we realise that this probability is proportional to

$$ Pr({\bf y} | \mu, \sigma^{2})Pr(\mu, \sigma^{2}) $$

where the first term is the likelihood, and the second term represents our prior belief in the values that the model parameters could take. Because the choice of prior is rarely justified by an objective quantification of the state of knowledge it has come under criticism, and indeed we will see later that the choice of prior can make a difference.

Likelihood

We can generate 5 observations from this distribution using rnorm:

Ndata<-data.frame(y=rnorm(5, mean=0, sd=sqrt(1)))
Ndata$y

We can plot the probability density function for the standard normal using dnorm and we can then place the 5 data on it:

pos.y<-seq(-3,3,0.1)                              # possible values of y
Probability<-dnorm(pos.y, mean=0, sd=sqrt(1))     # density of possible values 
plot(Probability~pos.y, type="l", xlab="y")
Probability.y<-dnorm(Ndata$y, mean=0, sd=sqrt(1)) # density of actual values
points(Probability.y~Ndata$y)

The likelihood of these data, conditioning on $\mu=0$ and $\sigma^2=1$, is proportional to the product of the densities (read off the y axis on Figure \@ref(fig:dnorm-fig)):

prod(dnorm(Ndata$y, mean=0, sd=sqrt(1)))

Of course we don't know the true mean and variance and so we may want to ask how probable the data would be if, say, $\mu=0$, and $\sigma^2=0.5$:

plot(dnorm(pos.y, mean=0, sd=sqrt(1))~pos.y, type="l", ylim=c(0,max(dnorm(0, mean=0, sd=sqrt(0.5)))), ylab="density", xlab="pos.y")
points(Ndata$y, dnorm(Ndata$y, mean=0, sd=sqrt(1)))
lines(dnorm(pos.y, mean=0, sd=sqrt(0.5))~pos.y, type="l", col="red")
points(Ndata$y, dnorm(Ndata$y, mean=0, sd=sqrt(0.5)), col="red")
prod(dnorm(Ndata$y, mean=0, sd=sqrt(0.5)))

It would seem that the data are more likely under this set of parameters than the true parameters, which we must expect some of the time just from random sampling. To get some idea as to why this might be the case we can overlay the two densities (Figure \@ref(fig:dnorm1-fig)), and we can see that although some data points (e.g. r round(Ndata$y[which(dnorm(Ndata$y, mean=0, sd=sqrt(0.5))<dnorm(Ndata$y, mean=0, sd=sqrt(1)))[1]],3)) are more likely with the true parameters, in aggregate the new parameters produce a higher likelihood.\

The likelihood of the data can be calculated on a grid of possible parameter values to produce a likelihood surface, as in Figure \@ref(fig:Lsurface-fig). The densities on the contours have been scaled so they are relative to the density of the parameter values that have the highest density (the maximum likelihood estimate of the two parameters). Two things are apparent. First, although the surface is symmetric about the line $\mu = \hat{\mu}$ (where $\hat{}$ stands for maximum likelihood estimate) the surface is far from symmetric about the line $\sigma^{2} = \hat{\sigma}^{2}$. Second, there are a large range of parameter values for which the data are only 10 times less likely than if the data were generated under the maximum likelihood estimates.

loglik<-function(par, y){sum(dnorm(y, par[1], sqrt(par[2]), log=TRUE))}
mu<-seq(-2,2,length=100)
sigma2<-seq(0.0001,5,length=100)
L<-matrix(0,100,100)
P<-L
for(i in 1:100){
for(j in 1:100){
L[i,j]<-loglik(c(mu[i], sigma2[j]), Ndata$y)
}}

```r|\mu, \sigma^{2})$. The likelihood has been normalised so that the maximum likelihood has a value of one."} contour(mu,sigma2,exp(L-max(L)), xlab=expression(mu), ylab=expression(sigma^2), cex.lab=1.25, nlevels=15)

### Maximum Likelihood (ML)

The ML estimator is the combination of $\mu$ and $\sigma^{2}$ that make the data most likely. Although we could evaluate the density on a grid of parameter values (as we did to produce Figure \@ref(fig:Lsurface-fig)) in order to locate the maximum, for such a simple problem the ML estimator can be derived analytically. However, so we don't have to meet some nasty maths later, I'll introduce and use one of R's generic optimising routines that can be used to maximise the likelihood function (in practice, the log-likelihood is maximised to avoid numerical problems):


```r
loglik<-function(par, y){sum(dnorm(y, par[1], sqrt(par[2]), log=TRUE))}
MLest<-optim(c(mean=0,var=1), fn=loglik, y=Ndata$y, control = list(fnscale = -1,reltol=1e-16))$par
contour(mu,sigma2,exp(L-max(L)), xlab=expression(mu), ylab=expression(sigma^2), cex.lab=1.5, nlevels=15, col.lab="red", lwd=1.2, col="blue")
points(MLest[1], MLest[2], pch=16, cex=2)
arrows(1.5, 0.5, MLest[1],MLest[2], length=0.15, lwd=2)
text(1.5, 0.4, "ML", pos=4, cex=1.5)

The first call to optim are starting values for the optimisation algorithm, and the second argument (fn) is the function to be maximised. By default optim will try to minimise the function hence multiplying by -1 (fnscale = -1). The algorithm has successfully found the mode:

MLest

Alternatively we could also fit the model using glm:

m1a.1<-glm(y~1, data=Ndata)
summary(m1a.1)

Here we see that although the estimate of the mean (intercept) is the same, the estimate of the variance (the dispersion parameter: formatC(summary(m1a.1)$dispersion, 3, format="f")) is higher when fitting the model using glm. In fact the ML estimate is a factor of $\frac{n}{n-1}$ smaller.

MLest["var"]*(5/4)

Restricted Maximum Likelihood (REML)

To see why this happens, imagine if we had only observed the first two values of $y$ (Figure \@ref(fig:muvar-fig)). The variance is defined as the average squared distance between a random variable and the true mean. However, the ML estimator of the variance is the average squared distance between a random variable and the ML estimate of the mean. Since the ML estimator of the mean is the average of the two numbers (the dashed line) then the average squared distance will always be smaller than if the true mean was used, unless the ML estimate of the mean and the true mean coincide. This is why we divide by $n-1$ when estimating the variance from the sum of squares, and is the motivation behind REML.

y2<-sort(Ndata$y[c(3,5)])
pos.y<-seq(-3,3,0.1)
plot(dnorm(pos.y)~pos.y, type="l")
abline(v=0)

points(y2, dnorm(y2))
lines(c(y2[1],0), c(dnorm(y2[1]),dnorm(y2[1]))+0.01)
lines(c(y2[2],0), c(dnorm(y2[2]),dnorm(y2[2]))+0.01)
text(y2[1], dnorm(y2[1])+0.02, label=bquote(.(round(y2[1],3)) ^2 *"="* .(round((y2[1])^2,3))), pos=2)
text(y2[2], dnorm(y2[2])+0.02, label=bquote(.(round(y2[2],3)) ^2 *"="* .(round((y2[2])^2,3))), pos=4)

abline(v=mean(y2), lty=2)
lines(c(y2[1],mean(y2)), c(dnorm(y2[1],0),dnorm(y2[1],0))-0.01, lty=2)
lines(c(y2[2],mean(y2)), c(dnorm(y2[2],0),dnorm(y2[2],0))-0.01, lty=2)
text(y2[1], dnorm(y2[1])-0.02, label=bquote(.(round(y2[1]-mean(y2),3)) ^2 *"="* .(round((y2[1]-mean(y2))^2,3))), pos=2, col="darkgrey")
text(y2[2], dnorm(y2[2])-0.02, label=bquote(.(round(y2[2]-mean(y2),3)) ^2 *"="* .(round((y2[2]-mean(y2))^2,3))), pos=4, col="darkgrey")

Prior Distribution

MCMCglmm uses an inverse Wishart prior for the (co)variances and a normal prior for the fixed effects. In versions $>1.13$ parameter expanded models can be used which enable prior specifications from the the scaled non-central F-distribution \citep{Gelman.2006}. Here, we will focus on specifying a prior for a single fixed effect ($\mu$) and a single variance component using the inverse-Wishart to highlight some of the issues. I strongly recommend reading the section secPX-p on parameter expanded priors as these can be less informative than the inverse-Wishart under many situations.\

The inverse Wishart parameterisation in MCMCglmm is non-standard. For a single variance component the inverse Wishart takes two scalar parameters, V and nu. The distribution tends to a point mass on V as the degree of belief parameter, nu goes to infinity. The distribution tends to be right skewed when nu is not very large, with a mode of $\frac{V^{\ast}nu}{nu+2}$ but a mean of $\frac{V^{\ast}nu}{nu-2}$ (which is not defined for $nu<2$). The inverse gamma is a special case of the inverse Wishart, although it is parametrised using shape and scale, where $nu=2\astshape$ and $V = \frac{scale}{shape}$ (or $shape= \frac{nu}{2}$ and $scale= \frac{nu*V}{2}$). The R package MCMCpack provides a density function (dinvgamma`) for the inverse gamma distribution.\

As before, we can evaluate and plot density functions in order to visualise what the distribution looks like. Figure \@ref(fig:dinvgamma-fig) plots the probability density functions holding V equal to one but with nu varying.\

``r) and varying degree of belief parameter (nu). WithV=1 these distributions are equivalent to inverse gamma distributions with shape and scale parameters set tonu`/2."} xv<-seq(1e-16,5,length=1000) plot(0, type="n", ylim=c(0,0.45), xlim=c(0,5)) nu<-c(1,0.2,0.02, 0.002) V<-1 for(i in 1:4){ dv<-MCMCpack::dinvgamma(xv, shape=nu[i]/2, scale=(nu[i]V)/2) lines(dv~xv, type="l", col=paste("gray", 100-i25)) text(xv[which.min(abs(log(dv)-log(0.1xv))[-1])+1],dv[which.min(abs(log(dv)-log(0.1xv))[-1])+1], paste("nu=", nu[i]), pos=4) } text(4,0.4, "V=1", cex=1.3)

A probability distribution must integrate to one because a variable must have some value. It therefore seems reasonable that when specifying a prior, care must be taken that this condition is met. In the example here where `V` is a single variance this condition is met if `V>0` and `nu>0`.  If this condition is not met then the prior is said to be improper, and in WinBUGS (and possibly other software) improper priors cannot be specified.  Although great care has to be taken when using improper priors, `MCMCglmm` does allow them as they have some useful properties, and some common improper priors are discussed in section *IP-sec*. However, for now we will use the prior specification $`V`=1$ and $`nu`=0.002$ which is frequently used for variance components. For the mean we will use a diffuse normal prior centred around zero but with very large variance ($10^{8}$). If the variance is finite then the prior is always proper.\\

As before we can write a function for calculating the (log) prior probability:

```r
logprior<-function(par, priorR, priorB){
    dnorm(par[1], mean=priorB$mu, sd=sqrt(priorB$V), log=TRUE)+log(MCMCpack::dinvgamma(par[2],shape=priorR$nu/2,scale=(priorR$nu*priorR$V)/2))    
}

where priorR is a list with elements V and nu specifying the prior for the variance, and priorB is a list with elements mu and V specifying the prior for the mean. MCMCglmm takes these prior specifications as a list:

prior<-list(R=list(V=1, nu=0.002), B=list(mu=0, V=1e+8))

Posterior Distribution

To obtain a posterior density we need to multiply the likelihood by the prior probability for that set of parameters. We can write a function for doing this:

loglikprior<-function(par, y, priorR, priorB){
loglik(par,y)+logprior(par,priorR,priorB)
}

and we can overlay the posterior densities on the likelihood surface we calculated before (Figure \ref{Lsurface-fig}).\

P<-L
for(i in 1:100){
for(j in 1:100){
P[i,j]<-loglikprior(c(mu[i], sigma2[j]), Ndata$y, prior$R, prior$B)
}}

``r|\\mu, \\sigma^{2})$ in black, and the posterior distribution $Pr(\\mu, \\sigma^{2} | {\\bf y})$ in red. The likelihood has been normalised so that the maximum likelihood has a value of one, and the posterior distribution has been normalised so that the posterior mode has a value of one. The prior distributions $Pr(\\mu)\\sim N(0, 10^8)$ and $Pr(\\sigma^{2})\\sim IW(V=1,nu`=0.002)$ were used."} contour(mu,sigma2,exp(L-max(L)), xlab=expression(mu), ylab=expression(sigma^2), cex.lab=1.25, nlevels=15) contour(mu,sigma2,exp(P-max(P)), add=TRUE, nlevels=15, col="red", drawlabels = FALSE)

The prior has some influence on the posterior mode of the variance, and we can use an optimisation algorithm again to locate the mode:  

```r
Best<-optim(c(mean=0,var=1), fn=loglikprior, y=Ndata$y, priorR=prior$R, priorB=prior$B, method="L-BFGS-B", lower=c(-1e+5, 1e-5), upper=c(1e+5,1e+5), control = list(fnscale = -1, factr=1e-16))$par
Best

The posterior mode for the mean is identical to the ML estimate, but the posterior mode for the variance is even less than the ML estimate which is known to be downwardly biased. The reason that the ML estimate is downwardly biased is because it did no take into account the uncertainty in the mean. In a Bayesian analysis we can do this by evaluating the marginal distribution of $\sigma^{2}$ and averaging over the uncertainty in the mean.

Marginal Posterior Distribution

The marginal distribution is often of primary interest in statistical inference, because it represents our knowledge about a parameter given the data:

$$ Pr(\sigma^{2} | {\bf y}) \propto \int Pr(\mu, \sigma^{2} | {\bf y})d\mu $$

after averaging over any nuisance parameters, such as the mean in this case.\

Obtaining the marginal distribution analytically is usually impossible, and this is where MCMC approaches prove useful. We can fit this model in MCMCglmm pretty much in the same way as we did using glm:

m1a.2<-MCMCglmm(y~1, data=Ndata, prior=prior, thin=1, verbose=FALSE)

The Markov chain is drawing random (but often correlated) samples from the joint posterior distribution (depicted by the red contours in Figure \@ref(fig:Psurface-fig). The element of the output called Sol contains the distribution for the mean, and the element called VCV contains the distribution for the variance. We can produce a scatter plot:

contour(mu,sigma2,exp(L-max(L)), xlab=expression(mu), ylab=expression(sigma^2), cex.lab=1.25, nlevels=15)
contour(mu,sigma2,exp(P-max(P)), add=TRUE, nlevels=15, col="red", drawlabels = FALSE)
points(cbind(m1a.2$Sol, m1a.2$VCV))

and we see that MCMCglmm is sampling the same distribution as the posterior distribution calculated on a grid of possible parameter values (Figure \@ref(fig:PsurfaceMCMC-fig).\

```r | {\bf y})$. The black dots are samples from the posterior using MCMC, and the red contours are calculated by evaluating the posterior density on a grid of parameter values. The contours are normalised so that the posterior mode has a value of one."} contour(mu,sigma2,exp(P-max(P)), nlevels=15, col="red", drawlabels = FALSE) points(cbind(m1a.2$Sol, m1a.2$VCV), cex=0.2) contour(mu,sigma2,exp(P-max(P)), nlevels=15, col="red", add=TRUE, drawlabels = FALSE)

A very nice property of MCMC is that we can normalise the density so that it integrates to 1 (a true probability) rather than normalising it with respect to some other aspect of the distribution, such as the density at the ML estimator or the joint posterior mode as in Figures \@ref(fig:Lsurface-fig) and \@ref(fig:Psurface-fig). To make this clearer, imagine we wanted to know how much more probable the unit normal (i.e. with $\mu=0$ and $\sigma^{2}=1$) was than a normal distribution with the posterior modal parameters. We can calculate this by taking the ratio of the posterior densities at these two points:  \\ 

```r
nb<-names(Best)
names(Best)<-NULL
exp(loglikprior(Best, Ndata$y, prior$R, prior$B)-loglikprior(c(0,1), Ndata$y, prior$R, prior$B))
names(Best)<-nb

Now, if we wanted to know the probability that the parameters lay in the region of parameter space we were plotting, i.e. lay in the square $\mu = (-2,2)$ and $\sigma^{2} = (0,5)$ then this would be more difficult. We would have to evaluate the density at a much larger range of parameter values than we had done, ensuring that we had covered all regions with positive probability. Because MCMC has sampled the distribution randomly, this probability will be equal to the expected probability that we have drawn an MCMC sample from the region. We can obtain an estimate of this by seeing what proportion of our actual samples lie in this square:

prop.table(table(m1a.2$Sol>-2 & m1a.2$Sol<2 & m1a.2$VCV<5))

There is Monte Carlo error in the answer (formatC(prop.table(table(m1a.2$Sol>-2 & m1a.2$Sol<2 & m1a.2$VCV<5))[2], format="f",3)) but if we collect a large number of samples then this can be minimised.\

Using a similar logic we can obtain the marginal distribution of the variance by simply evaluating the draws in VCV ignoring (averaging over) the draws in Sol:

hist(m1a.2$VCV[which(m1a.2$VCV<5)])
abline(v=Best["var"], col="red")

```r | {\bf y})$ using MCMC. The vertical line is the joint posterior mode, which differs slightly from the marginal posterior mode (the peak of the marginal distribution)."} hist(m1a.2$VCV[which(m1a.2$VCV<5)], breaks=50, main=expression(paste("Posterior Distribution of ", sigma^2)), xlab=expression(sigma^2)) abline(v=Best["var"], col="red")

In this example (see Figure \@ref(fig:PsurfaceMCMC-fig)) the marginal mode and the joint mode are very similar, although this is not necessarily the case and can depend both on the data and the prior. Section *IP-sec* introduces improper priors that are non-informative with regard to the marginal distribution of a variance.\\

## MCMC

In order to be confident that MCMCglmm has successfully sampled the posterior distribution it will be necessary to have a basic understanding of MCMC methods. MCMC methods are often used when the joint posterior distribution cannot be derived analytically, which is nearly always the case. MCMC relies on the fact that although we cannot derive the complete posterior, we can calculate the height of the posterior distribution at a particular set of parameter values, as we did to obtain the contour plot in Figure \@ref(fig:Psurface-fig). However, rather than going systematically through every likely combination of $\mu$ and $\sigma$ and calculate the height of the distribution at regular distances, MCMC moves stochastically through parameter space, hence the name `Monte Carlo'.\\   

### Starting values

First we need to initialise the chain and specify a set of parameter values from which the chain can start moving through parameter space. Ideally we would like to pick a region of high probability, as we do not want to waste time wandering through regions of low probability: we are not so interested in determining the height of the distribution far outside of  Figure \@ref(fig:PsurfaceMCMC-fig) as it is virtually flat and close to zero (or at least we hope so!).  Although starting configurations can be set by the user using the `start` argument, in general the heuristic techniques used by MCMCglmm seem to work quite well. We will denote the parameter values of the starting configuration (time $t=0$) as $\mu_{t=0}$ and ${\sigma^{2}}_{t=0}$. There are several ways in which we can get the chain to move in parameter space, and MCMCglmm uses a combination of Gibbs sampling, slice sampling and Metropolis-Hastings updates. To illustrate, it will be easier to turn the contour plot of the posterior distribution into a perspective plot (Figure \@ref(fig:Psurface-persp-fig)).


```r
P.new<-L
mu.new<-seq(Best[1]-1, Best[1]+1, length=100)
sigma2.new<-seq(max(Best[2]-1,0.00001), Best[2]+1, length=100)
for(i in 1:100){
for(j in 1:100){
P.new[i,j]<-loglikprior(c(mu.new[i], sigma2.new[j]), Ndata$y, prior$R, prior$B)
}}

```r | {\bf y})$. This perspective plot is equivalent to the contour plot in Figure \@ref(fig:Psurface-fig)"}

par(bg="white") persp(mu.new,sigma2.new,exp(P.new-max(P.new)), theta = 20, ltheta=60, phi = 30, lphi=0.3, expand = 0.5, col ="lightblue", shade=0.3, xlab="", ylab="", zlab="") text(-0.45,-0.11,"Pr") text(-0.1,-0.35,expression(mu)) text(0.3,-0.2,expression(sigma^2))

## Metrpolis-Hastings updates

After initialising the chain we need to decide where to go next, and this decision is based on two rules.  First we have to generate a candidate destination, and then we need to decide whether to go there or stay where we are.  There are many ways in which we could generate candidate parameter values, and MCMCglmm uses a well tested and simple method. A random set of coordinates are picked from a multivariate normal distribution that is entered on the initial coordinates  $\mu_{t=0}$ and $\sigma^{2}_{t=0}$.  We will denote this new set of parameter values as  $\mu_{new}$ and $\sigma^{2}_{new}$. The question then remains whether to move to this new set of parameter values or remain at our current parameter values now designated as old $\mu_{old}=\mu_{t=0}$ and $\sigma^{2}_{old}=\sigma^{2}_{t=0}$.  If the posterior probability for the new set of parameter values is greater, then the chain moves to this new set of parameters and the chain has successfully completed an iteration: ($\mu_{t=1} = \mu_{new}$ and $\sigma^{2}_{t=1}=\sigma^{2}_{new}$).  If the new set of parameter values has a lower posterior probability then the chain may move there, but not all the time.  The probability that the chain moves to low lying areas, is determined by the relative difference between the old and new posterior probabilities.  If the posterior probability for $\mu_{new}$ and $\sigma^{2}_{new}$ is 5 times less than the posterior probability for $\mu_{old}$ and $\sigma^{2}_{old}$, then the chain would move to the new set of parameter values 1 in 5 times. If the move is successful then we set $\mu_{t=1} = \mu_{new}$ and $\sigma^{2}_{t=1}=\sigma^{2}_{new}$ as before, and if the move is unsuccessful then the chain stays where it is ($\mu_{t=1} = \mu_{old}$ and $\sigma^{2}_{t=1}=\sigma^{2}_{old}$).  Using these rules we can record where the chain has travelled and generate an approximation of the posterior distribution.  Basically, a histogram of Figure \@ref(fig:Psurface-persp-fig).\\

## Gibbs Sampling

Gibbs sampling is a special case of Metropolis-Hastings updating, and MCMCglmm uses Gibbs sampling to update most parameters.  In the Metropolis-Hastings example above, the Markov Chain was allowed to move in both directions of parameter space simultaneously.  An equally valid approach would have been to set up two Metropolis-Hastings schemes where the chain was first allowed to move along the $\mu$ axis, and then along the $\sigma^{2}$ axis. In Figure \@ref(fig:Psurface-persp2-fig) I have cut the posterior distribution of Figure \@ref(fig:Psurface-persp-fig) in half, and the edge of the surface facing left is the conditional distribution of $\mu$ given that $\sigma^{2}=1$:


\begin{equation}
Pr(\mu |\sigma^{2}=1, \bf{y}).
\end{equation}


```r
P.new2<-L
mu.new<-seq(Best[1]-1, Best[1]+1, length=100)
sigma2.new2<-seq(1, Best[2]+2, length=100)
for(i in 1:100){
for(j in 1:100){
P.new2[i,j]<-loglikprior(c(mu.new[i], sigma2.new2[j]), Ndata$y, prior$R, prior$B)
}}

```r | {\bf y})$, but only for values of $\sigma^{2}$ between 1 and 5, rather than 0 to 5 (Figure \@ref(fig:Psurface-persp-fig)). The edge of the surface facing left is the conditional distribution of the mean when $\sigma^{2}=1$ ($Pr(\mu | {\bf y}, \sigma^{2}=1)$). This conditional distribution follows a normal distribution."} par(bg="white") persp(mu.new,sigma2.new2,exp(P.new2-max(P.new2)), theta = 50, ltheta=90, phi = 20, lphi=80, expand = 0.5, col ="lightblue", shade=0.1, xlab="", ylab="", zlab="") text(-0.402,0,"Pr") text(-0.2,-0.3,expression(mu)) text(0.2,-0.3,expression(sigma^2))

 In some cases, the equation that describes this conditional distribution can be derived despite the equation for the complete joint distribution of Figure \@ref(fig:Psurface-persp-fig) remaining unknown.  When the conditional distribution of $\mu$ is known we can use Gibbs sampling. Lets say the chain at a particular iteration is located at $\sigma^{2}=1$.  If we updated $\mu$ using a Metropolis-Hastings algorithm we would generate a candidate value and evaluate its relative probability compared to the old value.  This procedure would take place in the slice of posterior facing left in Figure \@ref(fig:Psurface-persp2-fig). However, because we know the actual equation for this slice we can just generate a new value of $\mu$ directly. This is Gibbs sampling.  The slice of the posterior that we can see in Figure \@ref(fig:Psurface-persp2-fig) actually has a normal distribution. Because of the weak prior this normal distribution has a mean close to the mean of $\bf{y}$ and a variance close to $\frac{\sigma^{2}}{n} = \frac{1}{n}$.  Gibbs sampling can be much more efficient than Metropolis-Hastings updates, especially when high dimensional conditional distributions are known, as is typical in GLMMs. A technical description of the sampling schemes used by MCMCglmm is given in appendix *MCMC-app*, but is perhaps not important to know.

## Slice Sampling

If the distribution can be factored such that one factor is a distribution from which truncated random variables can be drawn, then the slice sampling methods of \citet{Damien.1999} can be used. The latent variables in univariate binary models can be updated in this way if `slice=TRUE` is specified in the call to MCMCglmm. In these models, slice sampling is only marginally more efficient than adaptive Metropolis-Hastings updates when the residual variance is fixed. However, for parameter expanded binary models where the residual variance is not fixed, the slice sampler can be much more efficient.

## MCMC Diagnostics

When fitting a model using MCMCglmm the parameter values through which the Markov chain has travelled are stored and returned.  The length of the chain (the number of iterations) can be specified using the `nitt` argument (the default is 13,000), and should be long enough so that the posterior approximation is valid.  If we had known the joint posterior distribution in Figure \@ref(fig:Psurface-persp-fig) we could have set up a Markov chain that sampled directly from the posterior.  If this had been the case, each successive value in the Markov chain would be independent of the previous value after conditioning on the data, ${\bf y}$, and a thousand iterations of the chain would have produced a histogram that resembled Figure \@ref(fig:Psurface-persp-fig) very closely.  However, generally we do not know the joint posterior distribution of the parameters, and for this reason the parameter values of the Markov chain at successive iterations are usually not independent and care needs to be taken regarding the validity of the approximation.  MCMCglmm returns the Markov chain as `mcmc` objects, which can be analysed using the `coda` package.  The function `autocorr` estimates the level of non-independence between successive samples in the chain:

```r
autocorr(m1a.2$Sol)
autocorr(m1a.2$VCV)

The correlation between successive samples is low for the mean (formatC(autocorr(m1a.2$Sol)[2], format="f", 3)) but a bit high for the variance (formatC(autocorr(m1a.2$VCV)[2], format="f", 3)). When auto-correlation is high the chain needs to be run for longer, and this can lead to storage problems for high dimensional problems. The argument thin can be passed to MCMCglmm specifying the intervals at which the Markov chain is stored. In model m1a.2 we specified thin=1 meaning we stored every iteration (the default is thin=10). I usually aim to store 1,000-2,000 iterations and have the autocorrelation between successive stored iterations less than 0.1.\

The approximation obtained from the Markov chain is conditional on the set of parameter values that were used to initialise the chain. In many cases the first iterations show a strong dependence on the starting parametrisation, but as the chain progresses this dependence may be lost. As the dependence on the starting parametrisation diminishes the chain is said to converge and the argument burnin can be passed to MCMCglmm specifying the number of iterations which must pass before samples are stored. The default burn-in period is 3,000 iterations. Assessing convergence of the chain is notoriously difficult, but visual inspection and diagnostic tools such as gelman.diag often suffice.

plot(m1a.2$Sol)

On the left of Figure \@ref(fig:time-series-fig) is a time series of the parameter as the MCMC iterates, and on the right is a posterior density estimate of the parameter (a smoothed histogram of the output). If the model has converged there should be no trend in the time series. The equivalent plot for the variance is a little hard to see on the original scale, but on the log scale the chain looks good (Figure \@ref(fig:time-series2-fig) ):

plot(log(m1a.2$VCV))

Improper Priors

When improper priors are used their are two potential problems that may be encountered. The first is that if the data do not contain enough information the posterior distribution itself may be improper, and any results obtained from MCMCglmm will be meaningless. In addition, with proper priors there is a zero probability of a variance component being exactly zero but this is not necessarily the case with improper priors. This can produce numerical problems (trying to divide through by zero) and can also result in a reducible chain. A reducible chain is one which gets stuck' at some parameter value and cannot escape. This is usually obvious from themcmc` plots but MCMCglmm will often terminate before the analysis has finished with an error message of the form:

ill-conditioned G/R structure: use proper priors ...

However, improper priors do have some useful properties.

Flat Improper Prior

The simplest improper prior is one that is proportional to some constant for all possible parameter values. This is known as a flat prior and the posterior density in such cases is equal to the likelihood:

$$ Pr(\mu, \sigma^{2} | {\bf y}) \propto Pr({\bf y} | \mu, \sigma^{2}) \label{fprior-eq} $$

It is known that although such a prior is non-informative for the mean it is informative for the variance. We can specify a flat prior on the variance component by having nu=0 (the value of V is irrelevant) and the default prior for the mean is so diffuse as to be essentially flat across the range ($-10^6, 10^6$).

prior.m1a.3<-list(R=list(V=1, nu=0))
m1a.3<-MCMCglmm(y~1, data=Ndata, thin=1, prior=prior.m1a.3, verbose=FALSE)

We can overlay the joint posterior distribution on the likelihood surface (Figure \@ref(fig:Psurface-flat-fig)) and see that the two things are in close agreement, up to Monte Carlo error.

prior.m1a.3<-list(R=list(V=1, nu=0))
m1a.3<-MCMCglmm(y~1, data=Ndata, thin=1, prior=prior.m1a.3, verbose=FALSE)
kda.3<-MASS::kde2d(m1a.3$Sol, log(m1a.3$VCV), lims=c(-2,2,log(0.1),log(5)), n=50)
kda.3$y<-exp(kda.3$y)

``r|\\mu, \\sigma^{2})$ in black, and an MCMC approximation for the posterior distribution $Pr(\\mu, \\sigma^{2} | {\\bf y})$ in red. The likelihood has been normalised so that the maximum likelihood has a value of one, and the posterior distribution has been normalised so that the posterior mode has a value of one. Flat priors were used ($Pr(\\mu)\\sim N(0, 10^8)$ and $Pr(\\sigma^{2})\\sim IW(V=0,nu`=0)$) and so the posterior distribution is equivalent to the likelihood."} contour(mu,sigma2,exp(L-max(L)), xlab=expression(mu), ylab=expression(sigma^2), cex.lab=1.25, nlevels=15) contour(kda.3, add=TRUE, col="red", drawlabels = FALSE)

### Non-Informative Improper Prior

Although inverse-Wishart distributions with negative degree of belief parameters are not defined, the resulting posterior distribution can be defined if there is sufficient replication. Specifying `V=0` and `nu`=-1 is equivalent to a uniform prior for the standard deviation on the the interval $(0,\infty]$, and specifying `V`=0 and `nu`=-2 is non-informative for a variance component.

```r
prior.m1a.4<-list(R=list(V=1e-16, nu=-2))
m1a.4<-MCMCglmm(y~1, data=Ndata, thin=1, prior=prior.m1a.4, verbose=FALSE)
prior.m1a.4<-list(R=list(V=1e-16, nu=-2))
m1a.4<-MCMCglmm(y~1, data=Ndata, thin=1, prior=prior.m1a.4, verbose=FALSE)
kda.4<-MASS::kde2d(m1a.4$Sol, log(m1a.4$VCV), lims=c(-2,2,log(0.1),log(5)), n=50)
kda.4$y<-exp(kda.4$y)

``r|\\mu, \\sigma^{2})$ in black, and an MCMC approximation for the posterior distribution $Pr(\\mu, \\sigma^{2} | {\\bf y})$ in red. The likelihood has been normalised so that the maximum likelihood has a value of one, and the posterior distribution has been normalised so that the posterior mode has a value of one. A non-informative prior was used ($Pr(\\mu)\\sim N(0, 10^8)$ and $Pr(\\sigma^{2})\\sim IW(V=0,nu`=-2)$)"} contour(mu,sigma2,exp(L-max(L)), xlab=expression(mu), ylab=expression(sigma^2), cex.lab=1.25, nlevels=15) contour(kda.4, add=TRUE, col="red", drawlabels = FALSE)

The joint posterior mode does not coincide with either the ML or REML estimator (Figure \ref{Psurfaceb-fig}) but the marginal distribution of the variance component is equivalent to the REML estimator (See Figure \@ref(fig:Pmarg-NI-fig)):

```r | {\\bf y})$.  A non-informative prior specification was used ($Pr(\\mu)\\sim N(0, 10^8)$ and  $Pr(\\sigma^{2})\\sim IW(`V`=0, `nu`=-2)$) and the REML estimator of the variance (red line) coincides with the marginal posterior mode."}
hist(m1a.4$VCV[which(m1a.4$VCV<5)], breaks=30)
abline(v=summary(m1a.1)$dispersion, col="red")


jarrodhadfield/MCMCglmm documentation built on July 17, 2025, 2:18 a.m.