Markov chain Monte Carlo sampler for BPEC


Markov chain Monte Carlo for Bayesian Phylogeographic and Ecological Clustering, implemented in C. Given a dataset of DNA sequences (non-recombinant, typically mtDNA) and their respective geographical locations (longitude and latitude), the algorithm simultaneously draws inferences about the genealogy (in the form of a haplotype tree) and the population clustering (with an unknown number of clusters). In addition, the algorithm identifies locations with high posterior probability of being ancestral. In case where additional covariates are available (e.g. climate data), these may be added to the 2-dimensional data and inserted into the analysis.





The input DNA sequences.


A matrix where each row shows latitude, longitude (plus any additional covariates), plus all the haplotype numbers found at each location.


The maximum number of migration events (this means that the maximum number of clusters will be maxmig+1). The number you enter here is just an upper bound, so start with maxmig=6 and only increase it if you are really getting 7 clusters in return (which could mean that more clusters are appropriate). If, say, 4 clusters are needed, whether you use maxmig=6 or maxmig=10 (or similar), the number of clusters will collapse down to 4.


The number of iterations for the MCMC sampler, must be a multiple of 10. You will need quite a large number here, like 100,000. Two MCMC chains will run, after which convergence is checked. If convergence has not been reached, the output will say "NO CONVERGENCE" and you should increase the number of iterations.


This represents the parsimony relaxation parameter, with 0 being the minimum. Generally the higher ds, the more candidate trees are considered, but this comes at a computational cost. Start with ds=0 and increase to ds=1, etc, observing any changes.


How many posterior samples per chain to save for use post-processing. A value of at least PostSamples=1000 would provide a reasonable assessment of posterior uncertainty. PostSamples must not be greater than iter/10. Also, only up to 20 saved (thinned) samples are used in the BPEC.ContourPlot function.


The dimension, 2 for purely geographical data, +1 for each covariate (for example if environmental or phenotypic characteristics are also available).


This is the main command for BPEC, the details of which can be found in the provided references. In short, the model is such that, for a given haplotype tree, clusterings are the result of the migration of an individual, such that all the descendants of that individual belong to the new cluster (unless they subsequently migrate themselves). The number of migrating individuals is itself a parameter. Conditional on the clustering allocations for each sequence (note: not all samples of a haplotype need to belong to the same cluster), the geographical (and covariate) distribution for each cluster is assumed to be Gaussian. The distributions of longitude and latitude in each cluster are assumed to have an unknown covariance, whereas additional covariates are assumed to be independent.

Posterior samples are obtained through Markov chain Monte Carlo (MCMC) sampling. Two Markov chain Monte Carlo runs are carried out. In each, the sampler sweeps through the updates of all the parameters: the haplotype tree, the root of the tree, the number of clusters, the precise clustering (as a result of migrations) the locations and variances of the clusters. Metropolis-Hastings updates are performed for most of the parameters. In the case of the number of clusters, Metropolis-Hastings Reversible Jump updates are performed.

The space of possible clusterings is vast and highly multi-modal. The migration of an individual to a new cluster implies that all of their descendants will belong to the new cluster. This results in a combinatorial parameter space which is challenging to explore. A number of sophisticated tricks are used in order to overcome this challenge, alternating between local and global MCMC proposal moves. The biggest challenge is to converge to the region of high posterior probability in terms of the number of clusters and the cluster allocation. As such, the first 90 percent of the iterations are discarded as burn-in and only the final 10 percent are used as potential posterior samples.

BPEC.MCMC requires 2 input files in order to run:

Haplotypes.nex : The sequence file in NEXUS format. Sequence labels should either be integers, or contain unique integers which correspond to the labels in the CoordsLocsFile.txt. For example, '1', '1_label', 'label1_label' will all be treated as haplotype 1. NOTE: BPEC will currently ignore unknown nucleotides in the inference.

CoordsLocsFile.txt : For each observation, the coordinates (latitude and longitude, please use a +/- to indicate W or E), any other environmental or phenotypic covariates (the latitude and longitude MUST come first), plus the ID number of the haplotype (must match the number in the sequence NEXUS file). If more than one haplotype were found at a single location, these can be entered one after the other, eg:

  36.88	-5.42	 24	 25
  37.00	-3.98	245	251	243	142	143	244	246	247
  43.35  1.48	153		

so, in the first location (lat/long 36.88, -5.42) you have 2 sampled individuals with haplotypes 24,25, in the second location eight individuals etc. Sequences don't necessarily need to be collapsed onto haplotypes, the program should take care of it.



The output DNA sequences of distinct haplotypes, collapsed to effective nucleotide sites (both sampled and missing sequences which were inferred).


A matrix where each row has the coordinates of each location (lat/long), any environmental covariates, plus the haplotypes found at each location. All rows must have the same length, so sampling locations with fewer sampled sequences will be filled with NAs. Use the command provided in the example in order to create the input matrix


The dimension of the input (2 for geographical, +1 for every additional environmental/phenotypic covariate)


The number of sampling locations


The number of times each haplotype was observed


The number of input sequences


the effective length of the input sequences, given by the number of variable nucleotide sites which are informative. In other words, if two different nucleotide sites are variable in exactly the same haplotypes, then they effectively provide information of a single site.


A vector of the numerical labels of each haplotype.


The total number of haplotypes (observed and unobserved) in the final tree


Correspondence vector each of the processed observations to the original haplotype labels


Posterior means of the cluster centres


Posterior means of the cluster covariances


A set of posterior samples of the cluster centres


A set of posterior samples of the cluster covariances


The MAP adjacency matrix for the tree in vectorised format: this means that for two haplotypes i,j, the (i,j)th entry of the matrix is 1 if the haplotypes are connected in the network and 0 otherwise.


The set of edges (from and to haplotypes) of the Maximum A Posteriori haplotype tree (could be used in another program if needed)


Posterior probabilities of each edge being present, so that any edge which is not part of a loop will have posterior probability 1.


The posterior probability of 0...MaxMig migrations


The posterior probability that each haplotype was the root of the tree


Vector of posterior probabilities of each sampling location being the ancestral location


For each haplotype, posterior probabilities that it belongs to each cluster


Starting from the root (level 0) all the way to the tips, the discrete depth for the Maximum A Posteriori tree plot


Posterior means for each dimension of the centre of each cluster for each of the two MCMC chains.


Various tuning parameters used in the MCMC, this is only important for development


Posterior samples from the two MCMC chains for the cluster means, cluster covariance entries, as well as the root haplotype. Note that, since the number of clusters varies from iteration to iteration, some samples are simply draws from the prior (corresponding to empty clusters). This variable can be loaded directly into the coda package for convergence analysis.


Ioanna Manolopoulou


I. Manolopoulou and B.C. Emerson (2012). Phylogeographic ancestral inference using the coalescent model on haplotype trees. Journal of Computational Biology, 19(6), 745-755.

I. Manolopoulou, L. Legarreta, B.C. Emerson, S. Brooks, and S. Tavaré (2011). A Bayesian approach to phylogeographic clustering. Interface focus, rsfs20110054.


#to use the example dataset:
CoordsLocs = MacrocnemisCoordsLocs
RawSeqs = MacrocnemisRawSeqs

##to use your own dataset 
#RawSeqs ='Haplotypes.nex')
#CoordsLocs = read.table(
# 'CoordsLocsFile.txt',header=FALSE,fill=TRUE,col.names=1:max(count.fields('CoordsLocsFile.txt'))
#  )
## to set phenotypic/environmental covariate names, use (as appropriate)
# colnames(CoordsLocs)[1:dims] = c('lat','long','cov1','cov2','cov3')   
## where dims is the corresponding number of measurements available 
## (2 for latitude and longitude only, add one for each additional available measurement) 

#to run the MCMC sampler: 
MCMCout = BPEC.MCMC(RawSeqs,CoordsLocs,MaxMig=2,iter=50,ds=0,PostSamples=5,dims=8)
comments powered by Disqus