paleotree: Paleontological and Phylogenetic Analyses of Evolution

Documented in occData2timeList

#' Converting Occurrences Data to a \code{timeList} Data Object
#' 
#' This function converts occurrence data, given as a list where each element
#' is a different taxon's occurrence table (containing minimum and maximum ages
#' for each occurrence), to the \code{timeList} format, consisting of a list composed
#' of a matrix of lower and upper age bounds for intervals, and a second matrix
#' recording the interval in which taxa first and last occur in the given dataset.

#' @details
#' This function should translate taxon-sorted occurrence data, which could be Paleobiology Database
#' datasets sorted by \code{\link{taxonSortPBDBocc}} or any data object where occurrence data
#' (i.e. age bounds for each occurrence) for different taxa is separated into different elements
#' of a named list. 
#' 
#' \subsection{The Usage of the Argument \code{intervalType}}{

#' 
#' The argument \code{intervalType} controls the algorithm used for obtain first and last interval bounds for
#' each taxon, of which there are several options for \code{intervalType} to select from:
#' \describe{

#'  \item{\code{"dateRange"}}{The default option. The bounds on the first appearances
#' are the span between the oldest upper and lower bounds
#' of the occurrences, and the bounds on the last appearances are the span between the youngest
#' upper and lower bounds across all occurrences. This is guaranteed to provide the smallest
#' bounds on the first and last appearances, and was originally suggested to the author by J. Marcot.}

#' \item{\code{"occRange"}}{This option returns the smallest bounds among (a) the oldest occurrences for the
#' first appearance (i.e. all occurrences with their lowest bound at the oldest lower age bound), and (b) the
#' youngest occurrences for the last appearance (i.e. all occurrences with their uppermost bound
#' at the youngest upper age bound).}

#' \item{\code{"zoneOverlap"}}{This option is an attempt to mimic the stratigraphic range algorithm used by PBDB Classic
#' which "finds the oldest base that is older than at least part of all the intervals and the
#' youngest that is younger than at least part of all the intervals" (personal communication, J. Alroy). 
#' This is a somewhat more complex case as we are trying to obtain a \code{timeList} object.
#' So, for calculating the bounds of the first interval a taxon occurs in, the \code{zoneOverlap}
#' algorithm looks for all occurrences that overlap with the age range of the earliest-most occurrence
#' and (1) obtains their earliest boundary ages and returns the latest-most earliest age boundary among
#' these overlapping occurrences and (2) obtains their latest boundary ages and returns the earliest-most
#' latest age boundary among these overlapping occurrences. Similarly, for calculating the bound of the
#' last interval a taxon occurs in, the \code{zoneOverlap} algorithm looks for all occurrences that overlap
#' with the age range of the latest-most occurrence and (1) obtains their earliest boundary ages and returns
#' the latest-most earliest age boundary among these overlapping occurrences and (2) obtains their latest
#' boundary ages and returns the earliest-most latest age boundary among these overlapping occurrences. 
#' 
#' On theoretical grounds, one could probably describe the zone-of-overlap algorithm as minimizing
#' taxonomic age ranges by assuming that all overlapping occurrences at the start and end of a taxon's
#' range probably describe a very similar first and last appearance (FADs and LADs), and thus picks the
#' occurrence with bounds that extends the taxonomic range the least. However, this does come with a downside
#' that if these occurrences are not essentially repeated attempts to capture the same FAD or LAD, then the
#' zone-of-overlap algorithm is not an accurate depiction of the uncertainty in the ages. The true biological
#' range of a taxon might be well outside the bounds obtained using the zone-of-overlap algorithm. A more
#' conservative approach is the \code{"dateRange"} algorithm which finds the smallest possible bounds on the
#' endpoints of a taxon's range without ignoring uncertainty from any particular set of occurrences.} }
#' 
#' }
#' 

#' @param occList A list where every element is a table of occurrence data for a different taxon,
#' such as that returned by \code{\link{taxonSortPBDBocc}}. The occurrence data can be either a 
#' two-column matrix composed of the lower and upper age bounds on each taxon occurrence, or has
#' two named variables which match any of the field names given by the PBDB API under either
#' the \code{'pbdb'} vocab or \code{'com'} (compact) vocab for early and late age bounds.

#' @param intervalType Must be either \code{"dateRange"} (the default), \code{"occRange"} or
#' \code{"zoneOverlap"}. Please see details below.

#' @return
#' Returns a standard \code{timeList} data object, as used by
#' many other \code{paleotree} functions, like
#' \code{\link{bin_timePaleoPhy}}, \code{\link{bin_cal3TimePaleoPhy}}
#' and \code{\link{taxicDivDisc}}

#' @seealso
#' Occurrence data as commonly used with \code{paleotree} functions can
#' be obtained with \code{link{getPBDBocc}}, and sorted into taxa by 
#' \code{\link{taxonSortPBDBocc}}, and further explored with this function and
#' \code{\link{plotOccData}}. Also, see the example graptolite dataset
#' at \code{\link{graptPBDB}}

#' @author 
#' David W. Bapst, with the \code{"dateRange"} algorithm suggested by Jon Marcot.

#' @examples
#' data(graptPBDB)
#' 
#' graptOccSpecies <- taxonSortPBDBocc(
#'    data = graptOccPBDB,
#'    rank = "species",
#'    onlyFormal = FALSE)
#' graptTimeSpecies <- occData2timeList(occList = graptOccSpecies)
#' 
#' head(graptTimeSpecies[[1]])
#' head(graptTimeSpecies[[2]])
#' 
#' graptOccGenus <- taxonSortPBDBocc(
#'    data = graptOccPBDB,
#'    rank = "genus",
#'    onlyFormal = FALSE
#'    )
#' graptTimeGenus <- occData2timeList(occList = graptOccGenus)
#' 
#' layout(1:2)
#' taxicDivDisc(graptTimeSpecies)
#' taxicDivDisc(graptTimeGenus)
#' 
#' # the default interval calculation is "dateRange"
#' # let's compare to the other option, "occRange"
#'    # but now for graptolite *species*
#' 
#' graptOccRange <- occData2timeList(
#'    occList = graptOccSpecies, 
#'    intervalType = "occRange"
#'    )
#' 
#' #we would expect no change in the diversity curve
#'    #because there are only changes in th
#'        #earliest bound for the FAD
#'        #latest bound for the LAD
#' #so if we are depicting ranges within maximal bounds
#'    #dateRanges has no effect
#' layout(1:2)
#' taxicDivDisc(graptTimeSpecies)
#' taxicDivDisc(graptOccRange)
#' #yep, identical!
#' 
#' #so how much uncertainty was gained by using dateRange?
#' 
#' # write a function for getting uncertainty in first and last
#'    # appearance dates from a timeList object
#' sumAgeUncert <- function(timeList){
#'    fourDate <- timeList2fourDate(timeList)
#'    perOcc <- (fourDate[,1] - fourDate[,2]) +
#'        (fourDate[,3] - fourDate[,4])
#'    sum(perOcc)
#'    }
#' 
#' #total amount of uncertainty in occRange dataset
#' sumAgeUncert(graptOccRange)
#' #total amount of uncertainty in dateRange dataset
#' sumAgeUncert(graptTimeSpecies)
#' #the difference
#' sumAgeUncert(graptOccRange) - sumAgeUncert(graptTimeSpecies)
#' #as a proportion
#' 1 - (sumAgeUncert(graptTimeSpecies) / sumAgeUncert(graptOccRange))
#' 
#' #a different way of doing it
#' dateChange <- timeList2fourDate(graptTimeSpecies) - 
#'     timeList2fourDate(graptOccRange)
#' apply(dateChange, 2, sum)
#' #total amount of uncertainty removed by dateRange algorithm
#' sum(abs(dateChange))
#' 
#' layout(1)
#' 

#' @name occData2timeList
#' @rdname occData2timeList
#' @export
occData2timeList <- function(occList, intervalType = "dateRange"){
		#intervalType = "dateRange"
	#the following is all original, though inspired by paleobioDB code
	#check intervalType
	if(!any(intervalType == c("occRange", "dateRange", "zoneOverlap"))){
		stop("intervalType must be one of 'dateRange' or 'occRange' or 'zoneOverlap'")}
	#get just occurrence data with pullOccListData
	taxaInt <- pullOccListData(occList)
	if(intervalType == "occRange"){
		#get earliest and latest intervals the taxa appear in
		taxaMin <- lapply(taxaInt,function(x) x[x[,1] == max(x[,1]),,drop = FALSE])
		taxaMax <- lapply(taxaInt,function(x) x[x[,2] == min(x[,2]),,drop = FALSE])
		#get smallest intervals
		taxaMin <- t(sapply(taxaMin,function(x) if(nrow(x)>1){
			x[which((-apply(x,1,diff)) == min(-apply(x,1,diff)))[1],]
			}else{x}))
		taxaMax <- t(sapply(taxaMax,function(x) if(nrow(x)>1){
			x[which((-apply(x,1,diff)) == min(-apply(x,1,diff)))[1],]
			}else{x}))
		#transpose and remove weird list attribute
		#taxaMin <- t(taxaMin)
		#taxaMax <- t(taxaMax)
		taxaMin <- cbind(unlist(taxaMin[,1]),unlist(taxaMin[,2]))
		taxaMax <- cbind(unlist(taxaMax[,1]),unlist(taxaMax[,2]))
		}
	if(intervalType == "dateRange"){
		#this algorithm suggested by Jon Marcot
			#should be smallest possible interval for FAD and LAD
		#get the earliest early age and earliest latest age as taxaMin (FAD)
		taxaMin <- t(sapply(taxaInt,function(x) apply(x,2,max)))
		#get the earliest early age and earliest latest age as taxaMax (LAD)
		taxaMax <- t(sapply(taxaInt,function(x) apply(x,2,min)))
		}
	if(intervalType == "zoneOverlap"){
		#emulates behavior of PBDB Classic by John Alroy
		#get earliest and latest intervals the taxa appear in
			#these will be set of occurrences that overlap with min/max occs
		taxaMin <- lapply(taxaInt,function(x) x[x[,1] == max(x[,1]),,drop = FALSE])
		taxaMax <- lapply(taxaInt,function(x) x[x[,2] == min(x[,2]),,drop = FALSE])
		#find the latest early age and earliest latest age for occs that overlap
		taxaMin <- t(sapply(taxaMin,function(x) c(min(x[,1]),max(x[,2]))))
		taxaMax <- t(sapply(taxaMax,function(x) c(min(x[,1]),max(x[,2]))))		
		}	
	fourDate <- cbind(taxaMin,taxaMax)
	timeList <- fourDate2timeList(fourDate)
	rownames(timeList$taxonTimes) <- names(occList)
	return(timeList)
	}

	
# hidden function	
pullOccListData <- function(occList){
	#need checks
	#is occList a list
	if(!is.list(occList)){stop("occList is not a list?")}
	#are all elements of occList matrices
	if(!(all(sapply(occList,inherits,what = "data.frame")) | all(sapply(occList,inherits,what = "matrix")))){
		stop("All elements of occList must be all of type data.frame or type matrix")}
	#pull first list entry as an example to check with
	exOcc <- occList[[1]]
	#test if all occList entries have same number of columns
	if(!all(sapply(occList,function(x) ncol(x) == ncol(exOcc)))){
		stop("Not all occList entries have same number of columns?")}
	#will assume all data given in this manner either has columns named "" and ""
		#or has only two columns
	ageSelector <- NULL
	if(any(colnames(exOcc) == "early_age") & any(colnames(exOcc) == "late_age")){
		ageSelector <- c("early_age","late_age")
		}
	if(any(colnames(exOcc) == "eag") & any(colnames(exOcc) == "lag")){
		ageSelector <- c("eag","lag")
		}
	if(any(colnames(exOcc) == "max_ma") & any(colnames(exOcc) == "min_ma")){
		ageSelector <- c("max_ma","min_ma")
		}		
	if(is.null(ageSelector)){
		if(ncol(exOcc) != 2){
			ageSelector <- 1:2
		}else{
			stop("Data is not a two-column matrix of ages *and* data does not appear to be a data frame with named age columns (from the PBDB)")
			}
		}
	#get intervals in which taxa appear
	taxaOcc <- lapply(occList,function(x) x[,ageSelector])
	return(taxaOcc)
	}