R/getEventData.R

Defines functions getEventData

Documented in getEventData

##' @title Create \code{bammdata} object from MCMC output
##'
##' @description \code{getEventData} Reads shift configuration data (the
##'     "event data" output) from a \code{BAMM} analysis and creates a
##'     \code{bammdata} object. The \code{bammdata} object is fundamental
##'     for extracting information about macroevolutionary rate variation
##'     through time and among lineages.
##'
##' @param phy An object of class \code{phylo} - specifically, the
##'     time-calibrated tree that was analyzed with \code{BAMM}.
##'     Alternatively, a character string specifying the path to a
##'     newick-formatted tree.
##' @param eventdata A character string specifying the path to a \code{BAMM}
##'     event-data file. Alternatively, an object of class \code{data.frame}
##'     that includes the event data from a \code{BAMM} run.
##' @param burnin A numeric indicating the fraction of posterior samples to
##'     discard as burn-in.
##' @param nsamples An integer indicating the number of posterior samples to
##'     include in the \code{bammdata} object. May be \code{NULL}.
##' @param verbose A logical. If \code{TRUE} progess is outputted to the
##'     console. Defaults to \code{FALSE}.
##' @param type A character string. Either "diversification" or "trait"
##'     depending on your \code{BAMM} analysis.
##'
##' @details In the \code{BAMM} framework, an "event" defines a
##'     macroevolutionary process of diversification or trait evolution. Every
##'     sample from the posterior includes at least one process, defined by
##'     such an "event". If a given sample includes just a single event, then
##'     the dynamics of diversification or trait evolution can be described
##'     entirely by a single time-constant or time-varying process that begins
##'     at the root of the tree. Any sample from the posterior distribution
##'     may include a complex mixture of distinct processes. To represent
##'     temporal heterogeneity in macroevolutionary rates, \code{BAMM} models
##'     a rate \eqn{R}, e.g. speciation, as a function that changes
##'     exponentially with time:
##'
##'     \eqn{R(t) = R(0)*exp(b*t)}.
##'
##'     Here \eqn{R(0)} is the initial rate and \eqn{b} is a parameter
##'     determining how quickly that rate grows or decays with time. 
##'
##'     The \code{eventdata} file (or data frame) is a record of events and
##'     associated parameters that were sampled with \code{BAMM} during
##'     simulation of the posterior with reversible jump MCMC. This complex,
##'     information-rich file is processed into a \code{bammdata} object,
##'     which serves as the core data object for numerous downstream analyses.
##'     From a \code{bammdata} object, you can summarize rate variation
##'     through time, among clades, extract locations of rate shifts,
##'     summarize clade-specific rates of speciation and extinction, and more.
##'
##'     In general, the user does not need to be concerned with the details of
##'     a \code{bammdata} object. The object is used as input by a number of
##'     \code{BAMMtools} functions. 
##'
##'     The parameter \code{nsamples} can be used to reduce the total amount
##'     of data included in the raw eventdata output from a \code{BAMM} run.
##'     The final \code{bammdata} object will consist of all data for
##'     \code{nsamples} from the posterior. These \code{nsamples} are equally
##'     spaced after discarding some \code{burnin} fraction as "burn-in". If
##'     \code{nsamples} is set to \code{NULL}, the \code{bammdata} object will
##'     include all samples in the posterior after discarding the
##'     \code{burnin} fraction.
##'
##' @return A list with many components:
##' \itemize{
##'     \item edge: See documentation for class \code{phylo} in package ape.
##'     \item Nnode: See documentation for class \code{phylo} in package
##'         ape.
##'     \item tip.label: See documentation for class \code{phylo} in package
##'         ape.
##'     \item edge.length: See documentation for class \code{phylo} in
##'         package ape.
##'     \item begin: The beginning time of each branch in absolute time (the
##'         root is set to time zero)
##'     \item end: The ending time of each branch in absolute time.
##'     \item numberEvents: An integer vector with the number of events
##'         contained in \code{phy} for each posterior sample. The length of
##'         this vector is equal to the number of posterior samples in the
##'         \code{bammdata} object.
##'     \item eventData: A list of dataframes. Each element is a single
##'         posterior sample. Each row in a dataframe holds the data for a
##'         single event. Data associated with an event are: \code{node} - a
##'         node number. This identifies the branch where the event
##'         originates. \code{time} - this is the absolute time on that branch
##'         where the event originates (with the root at time 0). \code{lam1}
##'         - an initial rate of speciation or trait evolution. \code{lam2} -
##'         a decay/growth parameter. \code{mu1} - an initial rate of
##'         extinction. \code{mu2} - a decay/growth parameter. \code{index} -
##'         a unique integer associated with the event. See 'Details'.
##'     \item eventVectors: A list of integer vectors. Each element is a
##'         single posterior sample. For each branch in \code{phy} the index
##'         of the event that occurs along that branch. Branches are ordered
##'         increasing here and elsewhere.
##'     \item eventBranchSegs: A list of matrices. Each element is a single
##'         posterior sample. Each matrix has four columns: \code{Column 1}
##'         identifies a node in \code{phy}. \code{Column 2} identifies the
##'         beginning time of the branch or segment of the branch that
##'         subtends the node in \code{Column 1}. \code{Column 3} identifies
##'         the ending time of the branch or segment of the branch that
##'         subtends the node in \code{Column 1}. \code{Column 4} identifies
##'         the index of the event that occurs along the branch or segment of
##'         the branch that subtends the node in \code{Column 1}.
##'     \item tipStates: A list of integer vectors. Each element is a single
##'         posterior sample. For each tip the index of the event that occurs
##'         along the branch subtending the tip. Tips are ordered increasing
##'         here and elsewhere.
##'     \item tipLambda: A list of numeric vectors. Each element is a single
##'         posterior sample. For each tip the rate of speciation or trait
##'         evolution at the end of the terminal branch subtending that tip.
##'     \item tipMu: A list of numeric vectors. Each element is a single
##'         posterior sample. For each tip the rate of extinction at the end
##'         of the terminal branch subtending that tip. Meaningless if working
##'         with \code{BAMM} trait results.
##'     \item meanTipLambda: For each tip the mean of the marginal posterior
##'         density of the rate of speciation or trait evolution at the end of
##'         the terminal branch subtending that tip.
##'     \item meanTipMu: For each tip the mean of the marginal posterior
##'         density of the rate of extinction at the end of the terminal
##'         branch subtending that tip. Meaningless if working with
##'         \code{BAMM} trait results.
##'     \item type: A character string. Either "diversification" or "trait"
##'         depending on your \code{BAMM} analysis.
##'     \item downseq: An integer vector holding the nodes of \code{phy}. The
##'         order corresponds to the order in which nodes are visited by a
##'         pre-order tree traversal.
##'     \item lastvisit: An integer vector giving the index of the last node
##'         visited by the node in the corresponding position in
##'         \code{downseq}. \code{downseq} and \code{lastvisit} can be used to
##'         quickly retrieve the descendants of any node. e.g. the descendants
##'         of node 89 can be found by
##'         \code{downseq[which(downseq==89):which(downseq==lastvisit[89])}.
##' }
##'
##' @note Currently the function does not check for duplicate tip labels in
##'     \code{phy}, which may cause the function to choke.
##'
##' @author Dan Rabosky, Mike Grundler
##'
##' @seealso \code{\link{summary.bammdata}}, \code{\link{plot.bammdata}},
##'     \code{\link{dtRates}}.
##' 
##' @references \url{http://bamm-project.org/}
##'
##' @examples
##' data(primates, events.primates)
##' xx <- getEventData(primates, events.primates, burnin=0.25, nsamples=500,
##'                    type = 'trait')
##' 
##' # compute mean phenotypic rate for primate body size evolution:
##' brates <- getCladeRates(xx)
##' mean(brates$beta)
##' 
##' # Plot rates:
##' plot(xx)
##' @keywords models
##' @export
getEventData <- function(phy, eventdata, burnin=0, nsamples = NULL, verbose=FALSE, type = 'diversification')
{	
	if (type != 'diversification' & type != 'trait') {
		stop("Invalid 'type' specification. Should be 'diversification' or 'trait'");
	}
	
	if (inherits(phy, 'character')) {
		phy <- read.tree(phy);
	}
	
	phy$node.label <- NULL;
	
	# check that tree is binary and rooted.
	if (!is.binary.phylo(phy)) {
		stop('phy is not completely bifurcating.')
	}
	if (!is.rooted.phylo(phy)) {
		stop('phy is not rooted.')
	}
	
	if (any(is.null(c(phy$begin, phy$end)))) {
		phy <- getStartStopTimes(phy)
	}
	
# Getting branching times direct from
#		the begin and end components of phy
#		should be able to now handle non-ultrametric trees.
	
	maxbt <- max(phy$end)
	nodes <- (length(phy$tip.label) + 1):(2*length(phy$tip.label) - 1)
	bt <- numeric(length(nodes))
	names(bt) <- nodes
	for (i in 1:length(bt)){
		tt <- phy$begin[phy$edge[,1] == nodes[i]][1]
		bt[i] <- maxbt - tt
	}
	
	
	########
	
	if (inherits(eventdata, 'data.frame')) {
		cat("Processing event data from data.frame\n");
		uniquegens <- sort(unique(eventdata[,1]));
	} 
	else if (inherits(eventdata, 'character')) {
		cat("Reading event datafile: ", eventdata, "\n\t\t...........");
		eventdata <- read.csv(eventdata, header=TRUE, stringsAsFactors=FALSE);
 		uniquegens <- sort(unique(eventdata[,1]));
 		cat("\nRead a total of", length(uniquegens), "samples from posterior\n");				
	} 
	else {
		err.string = c('eventdata arg invalid\n\nType is ', class(eventdata), '\n', sep='');
		stop(err.string);
	}
 	
 	samplestart <- uniquegens[floor(burnin*length(uniquegens))];
 	if (!length(samplestart)) {
 		samplestart <- 0;
 	}
 	uniquegens <- uniquegens[uniquegens >= samplestart];
 
 	if (is.null(nsamples)) {
 		nsamples <- length(uniquegens);
 	} 
 	else if (nsamples > length(uniquegens)) {
 		nsamples <- length(uniquegens);
 	}
	
	eventVectors <- vector("list",nsamples);
	eventData <- vector("list",nsamples);
	tipStates <- vector("list",nsamples);
	eventBranchSegs <- vector("list",nsamples);
	
	tipLambda <- vector("list",nsamples);
	tipMu <- vector("list",nsamples);
	
	goodsamples <- uniquegens[seq.int(1, length(uniquegens), length.out=nsamples)];
 	 
 	cat('\nDiscarded as burnin: GENERATIONS < ', goodsamples[1]);
 	cat("\nAnalyzing ", length(goodsamples), " samples from posterior\n");

 	numberEvents <- length(goodsamples); # vector to hold number of events per sample
 
 	cat('\nSetting recursive sequence on tree...\n');
 	phy <- getRecursiveSequence(phy);
 	cat('\nDone with recursive sequence\n\n');
 
	######### Get ancestors for each pair of taxa
	if (verbose) {
		cat("Start preprocessing MRCA pairs....\n");
	}	
		
	x2 <- eventdata[eventdata$generation %in% goodsamples, ];
		
	uniquePairSet <- matrix(NA, nrow=nrow(x2), ncol=2);	
	uniquePairNode <- numeric(nrow(x2));
	
	uniquePairSet[,1] <- as.integer(match(x2$leftchild, phy$tip.label));
	uniquePairSet[,2] <- as.integer(match(x2$rightchild, phy$tip.label, nomatch = 0L));
	uniquePairNode <- getmrca(phy, uniquePairSet[,1],uniquePairSet[,2]);
	
	if (verbose) {
		cat("Done preprocessing MRCA pairs....\n");
	}	

	####### Done with risky sstuff
	
	meanTipMu <- numeric(length(phy$tip.label));
	
 	meanTipLambda <- numeric(length(phy$tip.label)); 
 
 	stoptime <- maxbt;
 	
 	for (i in 1:length(goodsamples)) {
  		tmpEvents <- x2[x2[,1] == goodsamples[i], ];
		
		if (verbose) cat('Processing event: ', i, '\n');		
 
 		tm <- as.numeric(tmpEvents[,4]); # abs time of event
 		lam1 <- as.numeric(tmpEvents[,5]); # lambda parameter 1
 		lam2 <- as.numeric(tmpEvents[,6]); # lambda parameter 2
 		if (type == 'diversification') {	
			mu1 <- try(as.numeric(tmpEvents[, 7]),silent=TRUE); # mu parameter 1
			if (inherits(mu1,"try-error")) {
				stop("Unidentified column in event data file. Maybe try setting argument 'type = trait'");
			}
 			mu2 <- as.numeric(tmpEvents[, 8]); #mu parameter 2 
		} 
		else { #for bamm trait data we set the mu columns to zero because those params don't exist	
			mu1 <- rep(0, nrow(tmpEvents)); 
 			mu2 <- rep(0, nrow(tmpEvents)); 
		}	
		
 		# Get subtending node for each event:
 		nodeVec <- uniquePairNode[x2[,1] == goodsamples[i]];
 		
 		if (sum(nodeVec == 0)  > 0) {
			stop('Failed to assign event to node\n');
		}
		
		# make a dataframe:
		dftemp <- data.frame(node=nodeVec, time=tm, lam1=lam1, lam2=lam2, mu1=mu1, mu2=mu2, stringsAsFactors=FALSE);
		
		dftemp <- dftemp[order(dftemp$time), ];
		dftemp$index <- 1:nrow(dftemp);
		rownames(dftemp) <- NULL;
		
		statevec <- rep(1, nrow(phy$edge));

		if (nrow(dftemp) > 1) {
			for (k in 2:nrow(dftemp)) {
				s1 <- which(phy$downseq == dftemp$node[k]);
				s2 <- which(phy$downseq == phy$lastvisit[dftemp$node[k]]);
				descSet <- phy$downseq[s1:s2];
				isDescendantNode <- phy$edge[,2] %in% descSet;				
				statevec[isDescendantNode] <- k;
			}				
		}

 		tmpEventSegMat <- matrix(0, nrow=(max(phy$edge) + nrow(dftemp) - 2), ncol=4);
 		
		#non.root <- c(1:length(phy$tip.label), (length(phy$tip.label)+2):max(phy$edge));
		non.root <- c(1:length(phy$tip.label), unique(phy$edge[,1]));
		non.root <- non.root[-match(length(phy$tip.label)+1, non.root)];
		pos <- 1;	
		
		is_noEventBranch = !(phy$edge[,2] %in% dftemp$node);
		
		if (sum(is_noEventBranch) > 0) {
			
			tmpEventSegMat[1:sum(is_noEventBranch), 1] <- phy$edge[,2][is_noEventBranch];
			tmpEventSegMat[1:sum(is_noEventBranch), 2] <- phy$begin[is_noEventBranch];
	 		tmpEventSegMat[1:sum(is_noEventBranch), 3] <- phy$end[is_noEventBranch];
	 		tmpEventSegMat[1:sum(is_noEventBranch), 4] <- statevec[is_noEventBranch];
	 			
		} else {
			tempEventSegMat <- cbind(phy$edge[, 2], phy$begin, phy$end, statevec);
		}
 		
		eventnodeset <- intersect(non.root, dftemp$node);
		pos <- 1 + sum(is_noEventBranch);
		for (k in eventnodeset) {
			events.on.branch <- dftemp[dftemp$node == k, ];
			events.on.branch <- events.on.branch[order(events.on.branch$time), ];
				
			fBranch <- phy$edge[,2] == k;
 			start.times <- c(phy$begin[fBranch], events.on.branch$time);
			stop.times <- c(events.on.branch$time, phy$end[fBranch]);
			parent <- phy$edge[,1][phy$edge[,2] == k];
			if (parent == (length(phy$tip.label) + 1)) {
				# Parent is root:
				proc.set <- c(1, events.on.branch$index);	
			} else {
				proc.set <- c(statevec[phy$edge[,2] == parent], events.on.branch$index);			
			}
				
 			zzindex = pos:(pos + nrow(events.on.branch));	
				
			tmpEventSegMat[zzindex, 1] <- rep(k, length(zzindex));
			tmpEventSegMat[zzindex, 2] <- start.times;
			tmpEventSegMat[zzindex, 3] <- stop.times;
			tmpEventSegMat[zzindex, 4] <- proc.set;		
			pos <- pos + 1 + nrow(events.on.branch);
		}
		
 		tmpEventSegMat <- tmpEventSegMat[order(tmpEventSegMat[,1]), ];
 	
 		eventBranchSegs[[i]] <- tmpEventSegMat;

		tipstates <- numeric(length(phy$tip.label));
		tipstates <- statevec[phy$edge[,2] <= phy$Nnode + 1];
		tipstates <- tipstates[order(phy$edge[phy$edge[,2] <= phy$Nnode + 1, 2])];
		
 		### Compute tip rates:
		
		#tiplam <- dftemp$lam1[tipstates] * exp(dftemp$lam2[tipstates] * (stoptime - dftemp$time[tipstates]));
		tiplam <- exponentialRate(stoptime - dftemp$time[tipstates], dftemp$lam1[tipstates], dftemp$lam2[tipstates]);
		tipmu <- dftemp$mu1[tipstates];
		
		meanTipMu <- meanTipMu + tipmu/nsamples;
		meanTipLambda <- meanTipLambda + tiplam/nsamples;
		
		### List assignments and metadata across all events:
		eventData[[i]] <- dftemp;	
		eventVectors[[i]] <- statevec;
		numberEvents[i] <- nrow(dftemp);
		tipStates[[i]] <- tipstates;
		
		tipLambda[[i]] <- tiplam;
		tipMu[[i]] <- tipmu;	
 	}
 	
	phy$numberEvents <- numberEvents;
	phy$eventData <- eventData;
	phy$eventVectors <- eventVectors;
	phy$tipStates <- tipStates;
	phy$tipLambda <- tipLambda;
	phy$meanTipLambda <- meanTipLambda;
	phy$eventBranchSegs <- eventBranchSegs; 	
	phy$tipMu <- tipMu;
	phy$meanTipMu = meanTipMu;
	if (type == 'diversification') {
		phy$type <- 'diversification';
	} 
	else {
		phy$type <- 'trait';	
	}
 	
	# adds new class: 'bammdata'
	class(phy) <- 'bammdata';
	return(phy);
}

Try the BAMMtools package in your browser

Any scripts or data that you put into this service are public.

BAMMtools documentation built on Sept. 13, 2024, 1:10 a.m.