sampleRanges: Sampling Taxon Ranges
In paleotree: Paleontological and Phylogenetic Analyses of Evolution

sampleRanges

R Documentation

Sampling Taxon Ranges

Description

A function for simulating the effect of incomplete sampling of the fossil record.

Usage

sampleRanges(
  taxaData,
  r,
  alpha = 1,
  beta = 1,
  rTimeRatio = 1,
  modern.samp.prob = 1,
  min.taxa = 2,
  ranges.only = TRUE,
  minInt = 0.01,
  merge.cryptic = TRUE,
  randLiveHat = TRUE,
  alt.method = FALSE,
  plot = FALSE
)

Arguments

`taxaData`	A two-column matrix of per-taxon ranges. The five-column matrix output of `simFossilRecord`, post transformation with `fossilRecord2fossilTaxa` can also be supplied, which will be common in simulation usages.
`r`	Instantaneous average sampling rate per lineage time units; given as a vector of `length = 1` or `length` equal to the number of taxa.
`alpha`	Alpha parameter of beta distribution; given as a vector of `length = 1` or `length` equal to the number of taxa.
`beta`	Beta parameter of beta distribution; given as a vector of `length = 1` or `length` equal to the number of taxa.
`rTimeRatio`	Ratio of most recent sampling rate over earliest sampling rate; given as a vector of `length = 1` or `length` equal to the number of taxa.
`modern.samp.prob`	Probability of sampling living taxa at the present day (time = 0), see below.
`min.taxa`	Minimum number of taxa sampled. The default is 2.
`ranges.only`	If `TRUE`, gives taxon first and last occurrences only. If `FALSE`, gives the time of all sampling events as a list.
`minInt`	Minimum interval size used for simulating complex models. See details.
`merge.cryptic`	If `TRUE`, sampling events for cryptic species will be merged into one taxon.
`randLiveHat`	If `TRUE`, taxa still alive at modern day have the end-point of their 'hat' chosen from a uniform distribution.
`alt.method`	If `TRUE`, use the alternative method of discretizing time even if a simple model of sampling is being simulated.
`plot`	If `TRUE`, plots the sampling models for each taxon against time.

Details

This function implements a range of sampling models in continuous time. Be default, sampling is simulated under the simplest model, where sampling occurs as a Poisson process under a instantaneous sampling rate (r) which is homogeneous through time and across lineages (Foote, 1997). Under this model, the waiting times to sampling events are exponentially distributed, with an average waiting time of 1/r. This useful property allows sampling to be rapidly simulated for many taxa under this simple model in sampleRanges, by repeatedly drawing waiting times between sampling events from an exponential distribution. This is the model that is run when alpha, beta and rTimeRatio.

In addition to this simple model, sampleRanges also can consider a range of additional models, including the "hatP" and "incP" options of Liow et al. (2010). To describe the behavior of these models, users alter the default values for alpha, beta and rTimeRatio. These parameters, and r, can either be a single value which describes the behavior of the entire dataset or as a vector, of same length as the number of taxa, which describes the per-taxon value. When any rTimeRatio, alpha or beta value is not equal to one, then the sampling rate will vary across the duration of a taxon's temporal range. In general, setting alpha and beta equal to a value above 2 will produce a "hat" or bell-shaped curve, where sampling rates peak at the midpoint of taxon ranges, while setting them unequal will produce asymmetric bell curves according to the beta function (Liow et al., 2010; Liow et al. set alpha = 4 and beta = 4). rTimeRatio is the ratio of the sampling rate of the latest (most recent) time divided by the earliest (oldest) time.

The input r values will be interpreted differently based on whether one r value or per-taxon values were used. If one value was input, then it is assumed that r represent the grand mean r for the entire dataset for purposes of time-varying r, such that if rTimeRatio is not equal to 1, taxa near the end and start of the dataset will have very different per-taxon mean sampling rate. If per-taxon values of r were input, then each r is consider the per-taxon mean sampling rate. These will not be changed, but any within-lineage variation is distributed so that the mean is still the input per-taxon value. This also changes the interpretation of rTimeRatio, such that when a single r value and rTimeRatio is given, it is assumed the ratio describes the change in sampling rates from the start of the dataset to the end, while if multiple values are given for either r or rTimeRatio will instead see the value as describing the ratio at the first and last times of each taxon. For the pure hat model, this interpretation of r as a grand mean sampling means that taxa will have a sampling rate of 2 * r at the mid-peak of their range, which will have considerable implications for taxonomic incompleteness.

The particular distinctions about these parameter values are important: all models simulated in sampleRanges are structured to be effectively nested inside a most general model with parameters r, alpha, beta and rTimeRatio.

Note that the modeling of sampling in this function is independent and secondary of the actual simulation of the ranges, which are (generally) produced by the models of simFossilRecord with argument r (sampling rate) not set. Thus, 'hat-shaped range distributions' are only contained within single morphotaxa – they do not cross multiple morphotaxa in the case of anagenesis. Cryptic taxa each have their own hat and do not share a single hat; by default the ranges of cryptic taxa are merged to produce the range of a single observed morphotaxon.

'Hats' are constrained to start and end with a taxon's range, representing the rise and fall of taxa in terms of abundance and geographic range (Liow et al., 2010). However, for still-living taxa at the modern day, it is unknown how much longer they may be alive (for memory-less Poisson models, there is no age-dependent extinction). The treatment of these taxa with regards to their 'hat' (i. e. the beta distribution) is controlled by the argument randLivehat. When randLiveHat = FALSE, the beta distribution is fit so that the last appearance of still-alive taxa at the modern day is treated as a last appearance for calculating the hat. When TRUE, the default option, the still-alive taxa are considered to have gotten some distance between 0 and 1 through the beta distribution, as of the modern day. This point of progression is stochastically selected for each taxon by pulling a number from a uniform distribution, and used for calculating the hat.

Because sampling rate varies over morphotaxon ranges under any of these more complex models, sampling events cannot be quickly simulated as waiting times pulled from an exponential distribution. Instead, the taxon durations are discretized into a large number of small time intervals of length minInt (see above; minInt should be small enough that only one sampling event could feasibly happen per interval). The probability of an event occurring within each interval is calculated and used to stochastically simulate sampling events. For each interval, a number between 0 and 1 is randomly pulled from a uniform distribution and to the per-interval sampling probability to test if a sampling event occurred (if the random number is less than the probability, a sampling event is recorded). In general, this method is slower but otherwise comparable to the quicker waiting times method. See the examples below for a small test of this.

As with many functions in the paleotree library, absolute time is always decreasing, i.e. the present day is zero.

If min.taxa is set to zero, the simulation may produce output in which no taxa were ever sampled.

If modern.samp.prob is set to 1.0 (the default), then living taxa will always be sampled at least at the present day (if there are any living taxa). If the probability is less than 1, they will be sampled with that probability at the modern day.

By default, this function will merge sampling events from morphologically cryptic taxa, listing them as occurrences for the earliest member of that group. To change this behavior, set merge.cryptic = FALSE.

Conditioning on sampling some minimum number of taxa may create strange simulation results for some analyses, such as simulation analyses of birth-death processes. Set min.taxa = 0 to remove this conditioning.

Value

If ranges.only is TRUE, then the output is a two-column per-taxon matrix of first and last appearances in absolute time. NAs mean the respective taxon was never sampled in the simulation.

If ranges.only = FALSE (the default), the output is a list, where each element is a vector of sampling events the timing of sampling events, each corresponding to a different taxon in the input. Elements that are NA are unsampled taxa.

Author(s)

David W. Bapst

References

Foote, M. 1997 Estimating Taxonomic Durations and Preservation Probability. Paleobiology 23(3):278–300.

Liow, L. H., T. B. Quental, and C. R. Marshall. 2010 When Can Decreasing Diversification Rates Be Detected with Molecular Phylogenies and the Fossil Record? Systematic Biology 59(6):646–659.

Examples

set.seed(444)
record <- simFossilRecord(
             p = 0.1, 
             q = 0.1, 
             nruns = 1,
	            nTotalTaxa = c(30,40), 
	            nExtant = 0
	            )
taxa <- fossilRecord2fossilTaxa(record)

# let's see what the 'true' diversity curve looks like in this case
layout(1:2)
taxicDivCont(taxa)
# simulate a fossil record with imperfect sampling with sampleRanges
rangesCont <- sampleRanges(taxa,r = 0.5)
# plot the diversity curve based on the sampled ranges
taxicDivCont(rangesCont)

# compare the true history to what we might observe!

#let's try more complicated models!

# a pull-to-the-recent model with x5 increase over time
   # similar to Liow et al.'s  incP
layout(1:2)
rangesCont1 <- sampleRanges(taxa,
                            r = 0.5,
                            rTimeRatio = 5,
                            plot = TRUE
                            )
taxicDivCont(rangesCont1)

# a hat-shaped model
layout(1:2)
rangesCont1 <- sampleRanges(taxa,
                            r = 0.5,
                            alpha = 4,
                            beta = 4,
                            plot = TRUE
                            )
taxicDivCont(rangesCont1)

# a combination of these
layout(1:2)
rangesCont1 <- sampleRanges(taxa,
                            r = 0.5,
                            alpha = 4,
                            beta = 4,
                            rTimeRatio = 5,
                            plot = TRUE
                            )
taxicDivCont(rangesCont1)

# testing with cryptic speciation
layout(1)
recordCrypt <- simFossilRecord(p = 0.1, q = 0.1, 
                               prop.cryptic = 0.5, 
                               nruns = 1,
                               nTotalTaxa = c(20,30), 
                               nExtant = 0
                               )
taxaCrypt <- fossilRecord2fossilTaxa(recordCrypt)
rangesCrypt <- sampleRanges(taxaCrypt,r = 0.5)
taxicDivCont(rangesCrypt)

#an example of hat-shaped models (beta distributions) when there are live taxa
set.seed(444)
recordLive <- simFossilRecord(p = 0.1, 
                              q = 0.05, 
                              nruns = 1,
                              nTotalTaxa = c(5,100),
                              nExtant = c(10,100)
                              )
taxaLive <- fossilRecord2fossilTaxa(recordLive)
#with end-points of live taxa at random points in the hat
rangesLive <- sampleRanges(taxaLive,
                           r = 0.1,
                           alpha = 4,
                           beta = 4,
                           randLiveHat = TRUE,
                           plot = TRUE
                           )
#with all taxa end-points at end-point of hat
rangesLive <- sampleRanges(taxaLive,
                           r = 0.1,
                           alpha = 4,
                           beta = 4,
                           randLiveHat = FALSE,
                           plot = TRUE
                           )



#simulate a model where sampling rate evolves under brownian motion
tree <- taxa2phylo(taxa,obs = taxa[,3])
sampRateBM <- rTraitCont(tree)
sampRateBM <- sampRateBM-min(sampRateBM)
layout(1:2)
rangesCont1 <- sampleRanges(taxa,r = sampRateBM,plot = TRUE)
taxicDivCont(rangesCont1)

#evolving sampling rate, hat model and pull of the recent
layout(1:2)
rangesCont1 <- sampleRanges(taxa,
                            r = sampRateBM,
                            alpha = 4,
                            beta = 4,
                            rTimeRatio = 5,
                            plot = TRUE
                            )
taxicDivCont(rangesCont1)
layout(1)

#the simpler model is simulated by pulling waiting times from an exponential
#more complicated models are simulated by discretizing time into small intervals
#are these two methods comparable?

#let's look at the number of taxa sampled under both methods

summary(replicate(100,sum(!is.na(
               sampleRanges(taxa,
                            r = 0.5,
                            alt.method = FALSE
                            )
               [,1]))))

summary(replicate(100,sum(!is.na(
               sampleRanges(taxa,
                            r = 0.5,
                            alt.method = TRUE
                            )
               [,1]))))
               
#they look pretty similar!

paleotree documentation built on Aug. 22, 2022, 9:09 a.m.