R/obtainDatedPosteriorTreesMrB.R
In paleotree: Paleontological and Phylogenetic Analyses of Evolution

Documented in obtainDatedPosteriorTreesMrB

#' Get the Sample of Posterior Trees from a Dated Phylogenetic Analysis
#' with MrBayes (Or a Summary Tree, such as the MCCT)
#' 
#' MrBayes is not great for getting samples of dated posterior
#' phylogenies, or for obtaining certain summary trees from
#' the posterior (specifically the MCCT and MAP, which are specific
#' trees in the posterior). This is because the tree
#' samples as returned are scaled relative to rate parameters in a
#' separate file. This function attempts to automate
#' the handling of multiple files (both \emph{.t} tree files and \emph{.p}
#' parameter files), as well as multiple files
#' associated with separate runs, to obtain samples of posterior
#' trees, or summary trees such as the MCCT or MAP.
#' These resulting trees are now scaled to units of time, but
#' not be placed correctly on an absolute time-scale
#' if all tips are extinct. See details of output below.

#' @details
#' This function is most useful for dealing with dating
#' analyses in MrBayes, particularly when tip-dating
#' a tree with fossil taxa, as the half-compatibility
#' and all-compatibility summary trees offered by the
#' '\code{sumt}' command in MrBayes can have issues properly
#' portraying summary trees from such datasets.

#' @param runFile A filename in the current directory,
#' or a path to a file that is either a \emph{.p} 
#' or \emph{.t} file from a MrBayes analysis. This filename
#' and path will be used for finding additional 
#' \emph{.t} and \emph{.p} files, via the \code{nRuns}
#' settings and assuming that files are in the
#' same directory \emph{and} these files are named under
#' typical MrBayes file naming conventions. (In other words,
#' if you have renamed your \emph{.p} or \emph{.t} files,
#' this function probably won't be able to find them.)

#' @param nRuns The number of runs in your analysis. This variable is used for figuring out what 
#' filenames will be searched for: if you specify that you have less runs than you
#' actually ran in reality, then some runs won't be examined in this function. Conversely,
#' specify too many, and this function will throw an error when it cannot find files it expects
#' but do not exist. The default for this argument
#' (\emph{two} runs) is based on the default number of runs in MrBayes.

#' @param burnin The fraction of trees sampled in the posterior discarded  and not returned
#' by this function directly, nor included in calculation of summary trees. Must be a numeric
#' value greater than 0 and less than 1.

#' @param getFixedTimes If \code{TRUE}, this function will also look for, scan, and parse an
#' associated NEXUS file. Ignoring any commented lines 
#' (i.e., anything between a set of rectangular brackets \code{[]} ), 
#' commands for fixing taxa will be identified, parsed and returned to the user, either as a message
#' printed to the R console if output is written to a file, or as a attribute named 'fixed ages'
#' if output as an R object (formatted as a two-column
#' table of OTU names and their respective fixed ages).
#' 
#' Please note: the code for \code{getFixedTimes = TRUE} contains a \code{while()}
#' loop in it for removing nested series of
#' square brackets (i.e. treated as comments in NEXUS files). Thus files with
#' ridiculously nested series of brackets may cause this code to take a while
#' to complete, or may even cause it to hang.

# weird out of place line - copy/paste error? (12-11-19)
# If the output is an R object, these objects with ??


#' @param getRootAges \code{FALSE} by default. 
#' If \code{TRUE}, and \code{getFixedTimes = TRUE}
#' as well as \code{file = NULL}
#' (such that trees will be assigned within the R memory
#' rather than saved to an external file), the
#' functions \code{setRootAge} and its wrapper function
#' \code{setRootAges} will be applied to the output
#' so that all output trees have \code{root.time}
#' elements for use with other functions in \code{paleotree}
#' as well as other packages.

#' @param originalNexusFile Filename (and possibly path too) to the original NEXUS file for this analysis.
#' Only tried if \code{getFixedTimes = TRUE}. If \code{NULL} (the default), then this function will
#' instead try to find a NEXUS file with the same name as implied by the filename used in other inputs. If
#' this file cannot be found, the function will fail. 

#' @param labelPostProb Logical. If \code{TRUE}, then nodes of the
#' output tree will be labeled with their respective posterior 
#' probabilities, as calculated based on the frequency of a clade
#' occurring across the post-burn-in posterior tree sample. If \code{FALSE},
#' this is skipped.
	
#' @param outputTrees Determines the output trees produced; for format of output, see section
#' on returned Value below. Must be of length one, and either \code{"all"},
#' which means all trees from the post-burn-in posterior will
#' returned, a number greater than zero, which will be the number of trees
#' randomly sampled from across the post-burn-in posterior and returned,
#' or a label for a type of summary tree selected from the posterior based on various
#' properties. The two most commonly seen such point-estimate-summaries are
#' the \emph{MCCT} tree, which stands for the 'maximum clade compatibility tree',
#' and the \emph{MAP} tree, which stands for the 'maximum a posteriori tree'. 
#' The MCCT is the single tree from the post-burn-in posterior which
#' contains the set of bifurcations (clades) with the highest product of posterior
#' probabilities (i.e. are found on the most trees in the post-burn-in posterior).
#' The MCCT tree is returned if the argument \code{outputTrees = "MCCT"} is used.
#' The MAP is the single tree from the post-burn-in posterior with the highest
#' posterior probabilty associated with it. Unfortunately, versions of
#' \code{paleotree} prior to version 3.2.1 did not use the posterior probability
#' to select the supposed 'MAP' tree. MrBayes provides two values
#' for each sampled tree and corresponding parameters: 
#' \code{LnPr}, the log prior probability of the current parameter 
#' proposals under the specified prior distributions, 
#' and \code{LnL}, the log likelihood of the cold chain, i.e. the log-likelihood
#' of the sampled parameter values, given the observed data and specified models.
#' Neither of these are the posterior probability. 
#' The true posterior probability (as given by Bayes Theorem) is 
#' the product of the likelihood and the prior probability, divided by
#' the likelihood of the model, the latter of which is very rarely known.
#' More commonly, the calculable portion of the posterior probability is
#' the product of the likelihood and the prior probability; or, here, easily
#' calculated as the log posterior probability, as the sum of the
#' log likelihood and log prior probability. Given confusion over application of 'MAP'
#' trees in previous version of paleotree, three options are available:
#' \code{"MAPosteriori"} for the Maximum A Posteriori tree (the MAP tree,
#' or the single tree in the posterior with the highest
#' posterior probability, as given by \code{LnPr + LnL}),
#' \code{"MAPriori"} for the Maximum A Priori tree (the tree in the
#' posterior sample with the highest prior probability, indep of
#' the data), and \code{"MaxLikelihood"}, the
#' tree with the highest model likelihood given the data, ignoring the prior probability.
#' The former option of \code{outputTrees = "MAP"} is deprecated,
#' as its previous implementation only examine \code{LnPr} and thus
#' returned the tree now referred to here as the \code{"MAPriori"} tree.
#' Interestingly, this bug had no effect when tip-dating methods is applied to 
#' datasets with no character matrix is provided (an empty matrix of '?' missing values is used)
#' in order to find dated phylogenies that maximize the fit to the dated tree prior, 
#' as the log likelihood for a tree with an empty matrix is \emph{always} zero, and thus the posterior
#' probability is always exactly identical to the prior probability. 
#' Overall, the Maximum A Posteriori tree is the "best" tree based
#' on the metric most directly considered by Bayesian analysis for
#' proposal acceptance, but the MCCT may be
#' the best tree for summarizing topological support.
#' In either case, point estimates of topology are often
#' problematic summaries for phylogenetic analyses.

#' @param file Filename (possibly with path) as a character string
#' leading to a file which will be overwritten with the output trees (or summary tree),
#' as a NEXUS file. If \code{NULL} (the default), the output will
#' instead be directly returned by this function.



#' @return
#' Depending on argument \code{file}, the output tree or trees is either
#' returned directly, or instead written out in NEXUS format via
#' ape's \code{write.NEXUS} function to an external file. The output
#' will consist either of multiple trees sampled from the post-burn-in posterior,
#' or will consist of a single phylogeny (a summary tree, either
#' the MCCT or the MAP - see the details for the argument \code{outputTrees}).
#' 
#' If the argument \code{setRootAges = TRUE} is not used,
#' users are warned that the resulting dated trees will
#' not have \code{$root.time} elements necessary
#' for comparison against an absolute time-scale. Wile the
#' trees may be scaled to units of absolute
#' time now, rather than with branch lengths expressed in
#' the rate of character change, the dates
#' estimated by some phylogenetics functions in 
#' R may give inaccurate estimates of when events
#' occur on the absolute time-scale if all tips are extinct.
#' This is because most functions for phylogenetics in R (and
#' elsewhere) will instead presume that the latest tip
#' will be at time 0 (the modern), which
#' may be wrong if you are using \code{paleotree} for
#' analyzing \emph{paleontological} datasets
#' consisting of entirely extinct taxa. This can be
#' solved by using argument \code{getFixedTimes = TRUE}
#' to obtain fixed tip ages, and then scaling the resulting output to absolute time using
#' the argument \code{setRootAges = TRUE}, which obtains
#' a \code{$root.time} element for each tree
#' using the functions \code{\link{setRootAge}} and 
#'\code{\link{setRootAges}} (for single and multiple phylogenies).
#' 


#' @seealso 
#' When the arguments \code{getFixedTimes = TRUE} and 
#' \code{setRootAges = TRUE} are used, the resulting output will be scaled to absolute time 
#' with the available fixed ages using functions \code{\link{setRootAge}}
#' and \code{\link{setRootAges}} (for single and multiple phylogenies).
#' This is only done if fixed ages are available and if the tree is not
#' being saved to an external file.
#' 
#' Maximum Clade Credibility trees are estimated using the function
#' \code{\link[phangorn]{maxCladeCred}} in package phangorn.
#' 
#' See function \code{link{tipDatingCompatabilitySummaryMrB}} for additional
#' ways of solely evaluating the topoligical information
#' in trees taken from MrBayes posterior samples.

#' @author 
#' David Bapst, with rescaling of raw output
#' trees via code originally written by Nicholas Crouch.

#' @examples
#' \dontrun{
#' 
#' MCCT <- obtainDatedPosteriorTreesMrB(
#'  	runFile = "C:\\myTipDatingAnalysis\\MrB_run_fossil_05-10-17.nex.run1.t",
#'  	nRuns = 2, 
#'  	burnin = 0.5,
#'  	outputTrees = "MCCT", 
#'  	file = NULL)
#' 
#' MAP <- obtainDatedPosteriorTreesMrB(
#'  	runFile = "C:\\myTipDatingAnalysis\\MrB_run_fossil_05-10-17.nex.run1.t",
#'  	nRuns = 2, 
#'  	burnin = 0.5, 
#'  	getFixedTimes = TRUE,
#'  	outputTrees = "MAPosteriori", 
#'  	file = NULL)
#' 
#' # get a root age from the fixed ages for tips
#' setRootAge(tree = MAP)
#' 
#' #pull a hundred trees randomly from the posterior
#' hundredRandomlySelectedTrees <- obtainDatedPosteriorTreesMrB(
#'  	runFile = "C:\\myTipDatingAnalysis\\MrB_run_fossil_05-10-17.nex.run1.t",
#'  	nRuns = 2, 
#'  	burnin = 0.5, 
#'  	getFixedTimes = TRUE,
#'  	getRootAges = TRUE,
#'  	outputTrees = 100, 
#'  	file = NULL)
#' 
#' 
#' }
#' 



#' @name obtainDatedPosteriorTreesMrB
#' @rdname obtainDatedPosteriorTreesMrB
#' @export
obtainDatedPosteriorTreesMrB <- function(
	runFile,nRuns = 2,burnin = 0.5,
	outputTrees,labelPostProb = FALSE,
	getFixedTimes = FALSE,
	getRootAges = FALSE,
	originalNexusFile = NULL,
	file = NULL){
	#
	#########################################################
	#checks
	if(length(outputTrees) != 1){
		stop("outputTrees must be of length 1")
		}
	if(all(
			outputTrees != c("all","MCCT","MAPosteriori","MAPriori","MaxLikelihood")
			)
		& !is.numeric(outputTrees)){
		########
		stop("outputTrees must be one of 'all', 'MCCT', or a numeric value indicating the number of trees to randomly sample from the posterior")
		}
	if(is.numeric(outputTrees)){	
		if(outputTrees<1){
			stop("If numeric, outputTrees must be greater than 0")
			}
		}
	if(burnin>1 | burnin<0){
		stop("burnin must be a value between 0 and 1")
		}
	if(getRootAges & !getFixedTimes){
		stop("Root ages cannot be set for output trees if fixed tip dates are not obtained")
		}
	if(getRootAges & !is.null(file)){
		stop("Root ages cannot be set when files are read to an external file format, please use file = NULL (the default)")
		}		
	#
	# Load tree, paramater and mcmc files produced by MrBayes
	#
	# # pick either a .p or .t file; can be either. Used for getting path. 
	# all needed .p and .t files must be in the same directory as this file
	#
	# take indicated file and get the basic run name
	runPath <- strsplit(runFile,split = "\\.run")[[1]]
	if(length(runPath) != 2){
		stop("Unable to parse the runPath correctly")
		}
	runPath <- runPath[[1]]
	# get list of tree files
	treeFiles <- lapply(1:nRuns,function(x){
		read.nexus(file = paste0(runPath,".run",x,".t"))}
		)
	# get list of .p files
	parFiles <- lapply(1:nRuns,function(x)
		read.table(file = paste0(runPath,".run",x,".p"),
			 header = T, skip = 1)
		)
	#############################
	# checks for length 
	for(i in 1:nRuns){
		if(length(treeFiles[[i]]) != nrow(parFiles[[i]])){
			stop("Parameter data and tree data not of the same length")
			}
		}
	####################################
	#	
	# @param getFixedTimes If \code{TRUE}, this function will also look for, scan, and parse an
	# associated NEXUS file. Ignoring any commented lines (ie. anything between "[   ]" ), commands
	# for fixing taxa will be identifiedd, parsed and returned to the user, either as a message
	# printed to the R console if output is read to a file, or as a attribute named 'fixed ages'
	# if output as an R object (formatted as a two-column table of OTU names and their respective fixed ages).
	# The search for the NEXUS file is controlled with argument \code{originalNexusFile}
	#
	# Please note: this has a while() loop in it for removing nested series of
	# square brackets (i.e. treated as comments in NEXUS files) then files with
	# extremely nested series of brackets may cause this code to take a while
	# to complete, or may even cause it to hang.
	#
	if(getFixedTimes){
		if(is.null(originalNexusFile)){
			# does runPath end with .nex already?
			endNex <- grepl("\\.nex$", runPath , ignore.case=TRUE)
			#
			# if the nexus file does not end with .nex, add it
			if(endNex){
				originalNexusFile <- runPath
			}else{
				originalNexusFile <- paste0(runPath,".nex")
				}			
			}
		fixedTable <- getMrBFixedAgesFromNexus(origNexusFile = originalNexusFile)
		if(nrow(fixedTable)==0 & getRootAges){
			stop("No fixed ages found for obtaining $root.time, cannot use argument getRootAges = TRUE")
			}
	}else{
		fixedTable <- NULL
		}
	####################
	#
	# Specify sample start based on burnin - here 50%
	sampleStart <- floor(length(treeFiles[[1]])*burnin)
	#
	# remove burnin from both
	treeFilesBurn <- lapply(treeFiles,function(x){
		startSamp <- floor(length(x)*burnin)
		x[startSamp:length(x)]
		}
	)
	parFilesBurn <- lapply(parFiles,function(x){
		startSamp <- floor(nrow(x)*burnin)
		x[startSamp:nrow(x),]
		}
	)
	# checks for length again
	for(i in 1:nRuns){
		if(length(treeFilesBurn[[i]]) != nrow(parFilesBurn[[i]])){
			stop("Parameter data and tree data not of the same length")
			}
		}
	###
	#
	# The branch lengths of the trees need to be scaled
		# by the estimated clock rate in the par files
	#
	# Function to rescale branch lengths of a phylogeny by clockrate
	rescaleMrBTree <- function(phy, rate){
		phy$edge.length <- phy$edge.length / rate$clockrate
		return(phy)
		}
	# now apply it
	rescaledTrees <- lapply(1:nRuns,function(x){ 
		lapply(1:length(treeFilesBurn[[x]]),function(y){
			rescaleMrBTree(treeFilesBurn[[x]][[y]],parFilesBurn[[x]][y,])
			})
		})
	#################################
	#
	# attach marginal likelihoods to each tree
	for(i in 1:nRuns){
		for(j in 1:length(rescaledTrees[[i]])){
			rescaledTrees[[i]][[j]]$LnPr <- parFilesBurn[[i]][j,]$LnPr
			rescaledTrees[[i]][[j]]$LnL <- parFilesBurn[[i]][j,]$LnL
			}
		}
	# concatanate trees from each run
	lumpTrees <- unlist(rescaledTrees,recursive = FALSE)
	# give class multiphylo
	attr(lumpTrees, "class") <- c("multiPhylo", class(lumpTrees))
	#
	##########################################
	#
	if(outputTrees == "all"){
		outTree <- lumpTrees
		if(labelPostProb){
			# assign posterior probabilities as node labels
			outTree <- lapply(outTree,getPosteriorProbabiities,postBurninTrees = lumpTrees)
			}
		}

	#
	if(outputTrees == "MAPriori"){
		# get Maximum A Priori tree
		LnPr <- sapply(lumpTrees,function(x) x$LnPr)
		whichMaxPrior <- which(LnPr == max(LnPr))
		if(length(whichMaxPrior)>1){
			warning(paste0(
				"Multiple trees found with identical maximum prior probability.\n",
				"May suggest overly flat priors. Returning the first found."
				))
			whichMaxPrior <- whichMaxPrior[1]
			}
		outTree <- lumpTrees[[whichMaxPrior]]
		if(labelPostProb){
			# assign posterior probabilities as node labels
			outTree <- getPosteriorProbabiities(tree = outTree,postBurninTrees = lumpTrees)
			}
		}
	#
	if(outputTrees == "MaxLikelihood"){
		# get Maximum Model Likelihood tree
		LnL <- sapply(lumpTrees,function(x) x$LnL)
		#print(LnL)
		#print(max(LnL))
		whichMaxLikelihood <- which(LnL == max(LnL))
		if(length(whichMaxLikelihood)>1){
			warning(paste0(
				"Multiple trees found with identical maximum likelihood values.\n",
				"May suggest overly flat likelihood surfaces. Returning the first found."
				))
			whichMaxLikelihood <- whichMaxLikelihood[1]
			}
		if(all(LnL == 0)){
			message(paste0(
				"All log likelihood values are 0, suggesting an empty character matrix was used.\n",
				"Posterior probability is thus exactly identical to prior probability."))
			}
		outTree <- lumpTrees[[whichMaxLikelihood]]
		if(labelPostProb){
			# assign posterior probabilities as node labels
			outTree <- getPosteriorProbabiities(tree = outTree,postBurninTrees = lumpTrees)
			}
		}		
	#
	if(outputTrees == "MAPosteriori"){
		# get Maximum A Posteriori tree
		#
		# get posterior probability
		LnL <- sapply(lumpTrees,function(x) x$LnL)
		LnPr <- sapply(lumpTrees,function(x) x$LnPr)
		# Posterior Prob = Prior Prob * Likelihood(data) / likelihood (model)
			# likelihood (model) is unknown (but is constant!)
		# the log posterior probability would be the sum of LnL and LnPr
		#print(LnPr)
		#print(LnL)
		LnPost <- LnPr + LnL
		#
		whichMaxPosterior <- which(LnPost == max(LnPost))
		if(length(whichMaxPosterior)>1){
			warning(paste0(
				"Multiple trees found with identical maximum posterior probability.\n",
				"May suggest overly flat priors or flat likelihood surfaces. Returning the first found."
				))
			whichMaxPosterior <- whichMaxPosterior[1]
			}
		if(all(LnL == 0)){
			message(paste0(
				"All log likelihood values are 0, suggesting an empty character matrix was used.\n",
				"Posterior probability is thus exactly identical to prior probability."))
			}
		outTree <- lumpTrees[[whichMaxPosterior]]
		if(labelPostProb){
			# assign posterior probabilities as node labels
			outTree <- getPosteriorProbabiities(tree = outTree,postBurninTrees = lumpTrees)
			}
		}		
	#
	if(outputTrees == "MCCT"){
		# turns out the MCCT isn't the tree with the highest likelihood
		outTree <- phangorn::maxCladeCred(lumpTrees)
		if(labelPostProb){
			# assign posterior probabilities as node labels
			outTree <- getPosteriorProbabiities(tree = outTree,postBurninTrees = lumpTrees)
			}
		}
	#
	if(is.numeric(outputTrees)){
		# randomly sample N trees from the posterior
		# first check
		if(outputTrees>length(lumpTrees)){
			stop(paste0("Numeric value for outputTrees (",outputTrees,
				") greater than the total number of post-burning trees (",
				length(lumpTrees),")"))
			}
		whichOutput <- sample(length(lumpTrees),outputTrees,replace = FALSE)
		outTree <- lumpTrees[whichOutput]
		if(labelPostProb){
			# assign posterior probabilities as node labels
			if(outputTrees>1){
				outTree <- lapply(outTree,getPosteriorProbabiities,postBurninTrees = lumpTrees)
			}else{
				outTree <- getPosteriorProbabiities(tree = outTree,postBurninTrees = lumpTrees)
				}
			}
		}
	########################################
	if(!is.null(file)){
		write.nexus(outTree, file = file)
		if(!is.null(fixedTable)){
			print("You may need to know which tip ages were fixed to a precise date")
			print("in order to assign absolute dates to the tree, as follows:")
			print(fixedTable)
			}
	}else{
		if(!is.null(fixedTable)){
			attr(outTree,"fixedTable") <- fixedTable
			#message("For use in R with paleotree and other packages, you may want to set the root age")
			if(is(outTree,"multiPhylo")){
				#message("Simply use function setRootAges() next")
				outTree<-setRootAges(outTree)
			}else{
				#message("Simply use function setRootAge() next")
				outTree<-setRootAge(outTree)
				}
			}
		return(outTree)
		}
	}

getPosteriorProbabiities <- function(tree,postBurninTrees){
	cladeFreq <- prop.clades(tree,postBurninTrees)
	postProbs <- cladeFreq/length(postBurninTrees)
	tree$node.label <- postProbs
	return(tree)
	}