# Models of Sampling and Extinction for Taxonomic Duration Datasets

### Description

These functions construct likelihood models of the observed frequency
of taxon durations, given either in discrete (`make_durationFreqDisc`

)
or continuous time (`make_durationFreqCont`

). These models can then be
constrained using functions available in this package and/or analyzed
with commonly used optimizing functions.

### Usage

1 2 3 4 | ```
make_durationFreqCont(timeData, groups = NULL, drop.extant = TRUE,
threshold = 0.01, tol = 1e-04)
make_durationFreqDisc(timeList, groups = NULL, drop.extant = TRUE)
``` |

### Arguments

`timeData` |
Two-column matrix of per-taxon first and last occurrence given in continuous time, relative to the modern (i.e. older dates are also the 'larger' dates). Unsampled taxa (e.g. from a simulation of sampling in the fossil record, listed as NAs the supplied matrix) are automatically dropped from the matrix and from groups simultaneously. |

`groups` |
Either NULL (the default) or matrix with the number of rows equal to the number of taxa and the number of columns equal to the number of 'systems' of categories for taxa. Taxonomic membership in different groups is indicated by numeric values. For example, a dataset could have a 'groups' matrix composed of a column representing thin and thick shelled taxa, coded 1 and 2 respectively, while the second column indicates whether taxa live in coastal, shelfal or deep marine settings, coded 1-3 respectively. Different combinations of groups will be treated as having independent sampling and extinction parameters in the default analysis, for example, thinly-shelled deep marine species will have separate parameters from thinly-shelled coastal species. Grouping systems could also represent temporal heterogeneity, for example, categorizing Paleozoic versus Mesozoic taxa. If groups are NULL (the default), all taxa are assumed to be of the same group with the same parameters. Unsampled taxa (e.g. from a simulation of sampling in the fossil record, listed as NAs in timeData or timeList) are automatically dropped from groupings and the time dataset (either timeData or timeList) and from groups simultaneously. |

`drop.extant` |
Drops all extant taxa from a dataset, w |

`threshold` |
The smallest allowable duration (i.e. the measured difference in the first and last occurrence dates for a given taxon). Durations below this size will be treated as "one-hit" sampling events. |

`tol` |
Tolerance level for determining whether a taxon from a continuous-time analysis is extant or not. Taxa which occur at a date less than tol are treated as occurring at the modern day (i.e. being functionally identical as occurring at 0 time). |

`timeList` |
A 2 column matrix with the first and last occurrences of taxa given in relative time intervals (i.e. ordered from first to last). If a list of length two is given for timeData, such as would be expected if the output of binTimeData was directly input, the second element is used. See details. Unsampled taxa (e.g. from a simulation of sampling in the fossil record, listed as NAs in the second matrix) are automatically dropped from the timeList and from groups simultaneously. Living taxa observed in the modern day are expected to be listed as last observed in a special interval (0,0), i.e. begins and ends at 0 time. This interval is always automatically removed prior to the calculation intermediary data for fitting likelihood functions. |

### Details

These functions effectively replace two older functions in paleotree, now removed,
`getSampRateCont`

and `getSampProbDisc`

. The
functions here do not offer the floating time interval options of
their older siblings, but do allow for greater flexibility in defining
constrains on parameter values. Differences in time intervals, or any
other conceivable discrete differences in parameters, can be modeled
using the generic `groups`

argument in these functions.

These functions use likelihood functions presented by Foote (1997). These analyses are ideally applied to data from single stratigraphic section but potentially are applicable to regional or global datasets, although the behavior of those datasets is less well understood.

As with many functions in the paleotree library, absolute time is always decreasing, i.e. the present day is zero and older dates are 'larger'. On the contrary, relative time is in intervals with non-zero integers that increase sequentially beginning with 1, from earliest to oldest.

For `make_durationFreqDisc`

, the intervals in timeList should be
non-overlapping sequential intervals of roughly equal length. These
should be in relative time as described above, so the earliest interval
should be 1 and the numbering should increase as the intervals go up with
age. If both previous statements are true, then differences in interval
numbers will represent the same rough difference in the absolute timing
of those intervals. For example, a dataset where all taxa are listed from
a set of sequential intervals of similar length, such as North American
Mammal assemblage zones, microfossil faunal zones or graptolite biozones
can be given as long as they are correctly numbered in sequential order
in the input. As a counter example, a dataset which includes taxa resolved
only to intervals as wide as the whole Jurassic and taxa resolved to
biozones within the Jurassic should not be included in the same input.
Drop taxa from less poorly resolved intervals from such datasets if you
want to apply this function, as long as this retains a large enough sample
of taxa listed from the sequential set of intervals.

Please check that the optimizer function you select actually converges. The likelihood surface can be very flat in some cases, particularly for small datasets (<100 taxa). If the optimizer does not converge, consider increasing iterations or changing the starting values.

### Value

A function of class "paleotreeFunc", which takes a vector equal to the number
of parameters and returns the *negative* log likelihood (for use with optim and
similar optimizing functions, which attempt to minimize support values). See the
functions listed at `modelMethods`

for manipulating and examining
such functions and `constrainParPaleo`

for constraining parameters.

Parameters in the output functions are named 'q' for the instantaneous per-capita extinction rate, 'r' for the instantaneous per-capita sampling rate and 'R' for the per-interval taxonomic sampling probability. Groupings follow the parameter names, separated by periods; by default, the parameters will be placed in at least group 1 (of a grouping with a single group), such that make_durationFreqCont by default creates a function with parameters named 'q.1' and 'r.1', while make_durationFreqDisc creates a function with parameters named 'q.1' and 'R.1'.

Note that the 'q' parameters estimated by `make_durationFreqDisc`

is scaled to
per lineage intervals and not to per lineage time-units. If intervals are the same length, this
can be easily corrected by multiplying 1 by the interval length. It is unclear
how to treat uneven intervals and I urge workers to consider multiple strategies.

For translating these sampling probabilities and sampling rates, see
`SamplingConv`

.

### Author(s)

David W. Bapst

### References

Foote, M. 1997 Estimating Taxonomic Durations and Preservation
Probability. *Paleobiology* **23**(3):278–300.

Foote, M., and D. M. Raup. 1996 Fossil preservation and the stratigraphic
ranges of taxa. *Paleobiology* **22**(2):121–140.

### See Also

See `freqRat`

, `sRate2sProb`

,
`qsRate2Comp`

`sProb2sRate`

and `qsProb2Comp`

.

### Examples

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | ```
#let's simulate some taxon ranges from an imperfectly sampled fossil record
set.seed(444)
record<-simFossilRecord(p=0.1, q=0.1, nruns=1,
nTotalTaxa=c(30,40), nExtant=0)
taxa<-fossilRecord2fossilTaxa(record)
rangesCont <- sampleRanges(taxa,r=0.5)
#bin the ranges into discrete time intervals
rangesDisc <- binTimeData(rangesCont,int.length=1)
#note that we made interval lengths = 1:
# thus q (per int) = q (per time) for make_durationFreqDisc
## Not run:
#old ways of doing it (defunct as of paleotree version 2.6)
getSampRateCont(rangesCont)
getSampProbDisc(rangesDisc)
## End(Not run)
#new ways of doing it
# we can constrain our functions
# we can use parInit, parLower and parUpper to control parameter bounds
#as opposed to getSampRateCont, we can do:
likFun<-make_durationFreqCont(rangesCont)
optim(parInit(likFun),likFun,lower=parLower(likFun),upper=parUpper(likFun),
method="L-BFGS-B",control=list(maxit=1000000))
#as opposed to getSampProbDisc, we can do:
likFun<-make_durationFreqDisc(rangesDisc)
optim(parInit(likFun),likFun,lower=parLower(likFun),upper=parUpper(likFun),
method="L-BFGS-B",control=list(maxit=1000000))
#these give the same answers (as we'd expect them to!)
#with newer functions we can constrain our functions easily
# what if we knew the extinction rate = 0.1 a priori?
likFun<-make_durationFreqCont(rangesCont)
likFun<-constrainParPaleo(likFun,q.1~0.1)
optim(parInit(likFun),likFun,lower=parLower(likFun),upper=parUpper(likFun),
method="L-BFGS-B",control=list(maxit=1000000))
#actually decreases our sampling rate estimate
# gets further away from true generating value, r = 0.5 (geesh!)
# but this *is* a small dataset...
``` |