sample.clade: General rate species sampling

Description Usage Arguments Value Author(s) Examples

View source: R/sample.clade.R

Description

Generates a data.frame containing either true occurrence times or time ranges for each of the desired species using a Poisson process. Allows for the Poisson rate to be (1) a constant, (2) a function of time, (3) a function of time and an environmental variable, or (4) a vector of numbers. Also allows as an optional parameter a distribution representing the expected occurrence number over a species duration (in which case average rate must be constant). Allows for further flexibility in (non-age dependent) rates by a shift times vector and environmental matrix parameters. Optionally takes a vector of time bins representing geologic periods, so that if the user wishes occurrence times can be presented as a range instead of true points. Finally, allows for an optional argument - the maximum of the age-dependent distribution - that can make the simulation faster, and for extra arguments the age-dependent preservation function may take. See sample.species - absolute time-dependent sampling - and sample.adpp - age-dependent sampling - for more information.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
sample.clade(
  sim,
  rr,
  tMax,
  S = NULL,
  envRR = NULL,
  rShifts = NULL,
  returnTrue = TRUE,
  bins = NULL,
  dFun = NULL,
  dFunMax = NULL,
  ...
)

Arguments

sim

A sim object, usually an output of bd.sim.

rr

Sampling rate (per species per million years) over time. It can be a numeric describing a constant rate, a function(t) describing the variation in sampling over time t, a function(t, env) describing the variation in sampling over time following both time AND an environmental variable (please see envRR for details) or a vector containing rates that correspond to each rate between sampling rate shift times times (please see rShifts). Note that rr should should always be greater than or equal to zero.

tMax

The maximum simulation time, used by rexp.var. A sampling time greater than tMax would mean the occurrence is sampled after the present, so for consistency we require this argument. This is also required to ensure time follows the correct direction both in the Poisson process and in the return.

S

A vector of species numbers to be sampled. Could be only a subset of the species if the user wishes. The default is all species in sim. Species not included in S will not be sampled by the function.

envRR

A matrix containing time points and values of an environmental variable, like temperature, for each time point. This will be used to create a sampling rate, so rr must be a function of time and said variable if envRR is not NULL. Note paleobuddy has two environmental data frames, temp and co2. See RPANDA for more examples.

rShifts

Vector of rate shifts. First element must be the starting time for the simulation (0 or tMax). It must have the same length as pp. c(0, x, tMax) is equivalent to c(tMax, tMax - x, 0) for the purposes of make.rate.

Note: using this method for step-function rates is currently slower than using ifelse.

returnTrue

If set to TRUE, the returned data frame will contain true times of sampling. If set to FALSE, we call binner and the the returned data frame will contain ranges of sampling times based on bins.

If returnTrue is false, sample.clade returns the occurrence times as ranges. In this way, we simulate the granularity in real world fossil records. If returnTrue is true, this is ignored. If bins is not supplied and returnTrue == FALSE, the default is seq(tMax, 0, 0.1).

bins

A vector of time intervals corresponding to geological time ranges.

dFun

A density function representing the age-dependent preservation model. It must be a density function, and consequently integrate to 1 (though this condition is not verified by the function). Additionally, the function must have the following properties:

  • Returns a vector of preservation densities for each time in a given vector t in geological time.

  • Be parametrized in absolute geological time (i.e. should be relative to absolute geological time, in Mya, not the lineage's age).

  • Should be limited between s (i.e. the lineage's speciation/origination in geological time) and e (i.e. the lineage's extinction in geological time), with s > e.

  • Include the arguments t, s, e and sp. The argument sp is used to pass species-specific parameters (see examples), allowing for dFun to be species-inhomogeneous.

dFunMax

A function that calculates the maximum (density) value of dFun using its arguments, excluding t. It can also be a number representing the maximum density.

Note: if it not provided, it will be approximated numerically, leading to longer running times.

...

Additional parameters related to dFun and dFunMax.

Value

A data.frame containing species names/numbers, whether each species is extant, and either the true occurrence times of species or a range of occurrence times based on bins.

Author(s)

Matheus Januario and Bruno do Rosario Petrucci.

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
###
# we can start with a constant case

# simulate a group
sim <- bd.sim(n0 = 1, pp = 0.1, qq = 0.1, tMax = 10)

# in case first simulation is short-lived
while ((sim$TS[1] - ifelse(is.na(sim$TE[1]), 0, sim$TE[1])) < 10) {
  sim <- bd.sim(n0 = 1, pp = 0.1, qq = 0.1, tMax = 10)
}

# we will need to get exact durations for some examples, so
sim$TE[sim$EXTANT] <- 0
# this is necessary since the default is to have NA for extant species

# sampling rate
r <- 2

# the resolution of the fossil dataset:
bins <- seq(from = 10, to = 0,
            by = -0.1)
# note that we will provide a very high resolution to test the function

# find the occurrence data frame
dt <- sample.clade(sim, r, tMax = 10, bins = bins, returnTrue = FALSE)

# extract species identity
ids <- unique(dt$Species)

# approximate sampling time (since it is a range)
mids <- (dt$MaxT - dt$MinT) / 2 + dt$MinT

# for each species
for (i in 1:length(ids)) {
  # get the species number
  sp <- unique(as.numeric(gsub("spp_", "", ids[i])))
  
  # check the histogram
  hist(mids[dt$Species == ids[i]],
       main = paste0("spp = ", sp, "; duration ~ ",
                     round(sim$TS[sp] - sim$TE[sp], digits = 2), "my"),
       xlab = "Time (My)",
       xlim = c(sim$TS[i], sim$TE[i]))
}

###
# sampling can be any function of time in the non-age dependent case, of course

# simulate a group
sim <- bd.sim(n0 = 1, pp = 0.1, qq = 0.1, tMax = 10)

# in case first simulation is short-lived
while ((sim$TS[1] - ifelse(is.na(sim$TE[1]), 0, sim$TE[1])) < 10) {
  sim <- bd.sim(n0 = 1, pp = 0.1, qq = 0.1, tMax = 10)
}

# we will need to get exact durations for some examples, so
sim$TE[sim$EXTANT] <- 0
# this is necessary since the default is to have NA for extant species

# sampling rate
r <- function(t) {
  return(3 - 0.15*t)
}

# the resolution of the fossil dataset:
bins <- seq(from = 10, to = 0,
            by = -0.1)
# note that we will provide a very high resolution to test the function

# find the occurrence data frame
dt <- sample.clade(sim, r, tMax = 10, bins = bins, returnTrue = FALSE)

# extract species identity
ids <- unique(dt$Species)

# approximate sampling time (since it is a range)
mids <- (dt$MaxT - dt$MinT) / 2 + dt$MinT

# for each species
for (i in 1:length(ids)) {
  # get the species number
  sp <- unique(as.numeric(gsub("spp_", "", ids[i])))
  
  # check the histogram
  hist(mids[dt$Species == ids[i]],
       main = paste0("spp = ", sp, "; duration ~ ",
                     round(sim$TS[sp] - sim$TE[sp], digits = 2), "my"),
       xlab = "Time (My)",
       xlim = c(sim$TS[i], sim$TE[i]))
}

###
# now we can try a step function rate

# simulate a group
sim <- bd.sim(n0 = 1, pp = 0.1, qq = 0.1, tMax = 10)

# in case first simulation is short-lived
while ((sim$TS[1] - ifelse(is.na(sim$TE[1]), 0, sim$TE[1])) < 10) {
  sim <- bd.sim(n0 = 1, pp = 0.1, qq = 0.1, tMax = 10)
}

# we will need to get exact durations for some examples, so
sim$TE[sim$EXTANT] <- 0
# this is necessary since the default is to have NA for extant species

# we will use the less efficient method of creating a step function
# one could instead use ifelse()

# rates vector
rList <- c(1, 2, 0.5)

# rate shifts vector
rShifts <- c(0, 4, 8)

# make it a function so we can plot it
r <- make.rate(rList, 10, fShifts=rShifts)

# the resolution of the fossil dataset:
bins <- seq(from = 10, to = 0,
            by = -0.1)
# note that we will provide a very high resolution to test the function

# find the occurrence data frame
dt <- sample.clade(sim, rList, rShifts = rShifts, 
                   tMax = 10, bins = bins, returnTrue = FALSE)

# extract species identity
ids <- unique(dt$Species)

# approximate sampling time (since it is a range)
mids <- (dt$MaxT - dt$MinT) / 2 + dt$MinT

# for each species
for (i in 1:length(ids)) {
  # get the species number
  sp <- unique(as.numeric(gsub("spp_", "", ids[i])))
  
  # check the histogram
  hist(mids[dt$Species == ids[i]],
       main = paste0("spp = ", sp, "; duration ~ ",
                     round(sim$TS[sp] - sim$TE[sp], digits = 2), "my"),
       xlab = "Time (My)",
       xlim = c(sim$TS[i], sim$TE[i]))
}

###
# finally, sample.clade also accepts an environmental variable

# get temperature data
data(temp)

# simulate a group
sim <- bd.sim(n0 = 1, pp = 0.1, qq = 0.1, tMax = 10)

# in case first simulation is short-lived
while ((sim$TS[1] - ifelse(is.na(sim$TE[1]), 0, sim$TE[1])) < 10) {
  sim <- bd.sim(n0 = 1, pp = 0.1, qq = 0.1, tMax = 10)
}

# we will need to get exact durations for some examples, so
sim$TE[sim$EXTANT] <- 0
# this is necessary since the default is to have NA for extant species

# make temperature the environmental dependency of r
envR <- temp

# we can then make sampling dependent on the temperature
r <- function(t, env) {
  return(0.5*env)
}

# make it a function so we can plot it
rr <- make.rate(r, envF = envR)

# let us check that r is high enough to see a pattern
plot(1:10, rr(1:10), type = 'l', main = "Sampling rate",
     xlab = "My", ylab = "r")

# the resolution of the fossil dataset:
bins <- seq(from = 10, to = 0,
            by = -0.1)
# note that we will provide a very high resolution to test the function

# find the occurrence data frame
dt <- sample.clade(sim, r, tMax = 10, envRR = envR,
                   bins = bins, returnTrue = FALSE)

# extract species identity
ids <- unique(dt$Species)

# approximate sampling time (since it is a range)
mids <- (dt$MaxT - dt$MinT) / 2 + dt$MinT

# for each species
for (i in 1:length(ids)) {
  # get the species number
  sp <- unique(as.numeric(gsub("spp_", "", ids[i])))

  # check the histogram
  hist(mids[dt$Species == ids[i]],
       main = paste0("spp = ", sp, "; duration ~ ",
                     round(sim$TS[sp] - sim$TE[sp], digits = 2), "my"),
       xlab = "Time (My)",
       xlim = c(sim$TS[i], sim$TE[i]))
}

# we will now do some tests with age-dependent rates. For more details,
# check sample.adpp.

###
# simulate a group
sim <- bd.sim(n0 = 1, pp = 0.1, qq = 0.1, tMax = 10)

# in case first simulation is short-lived
while ((sim$TS[1] - ifelse(is.na(sim$TE[1]), 0, sim$TE[1])) < 10) {
  sim <- bd.sim(n0 = 1, pp = 0.1, qq = 0.1, tMax = 10)
}

# we will need to get exact durations for some examples, so
sim$TE[sim$EXTANT] <- 0
# this is necessary since the default is to have NA for extant species

# here we will use the PERT function. It is described in:
# Silvestro et al 2014

# preservation function
dPERT <- function(t, s, e, sp, a = 3, b = 3, log = FALSE) {
  
  # check if it is a valid PERT
  if (e >= s) {
    message("There is no PERT with e >= s")
    return(rep(NaN, times = length(t)))
  }
  
  # find the valid and invalid times
  id1 <- which(t <= e | t >= s)
  id2 <- which(!(t <= e | t >= s))
  t <- t[id2]
  
  # initialize result vector
  res <- vector()
  
  # if user wants a log function
  if (log) {
    # invalid times get -Inf
    res[id1] <- -Inf
    
    # valid times calculated with log
    res[id2] <- log(((s - t) ^ 2)*((-e + t) ^ 2)/((s - e) ^ 5*beta(a,b)))
  }
  # otherwise
  else{
    res[id1] <- 0
    
    res[id2] <- ((s - t) ^ 2)*((-e + t) ^ 2)/((s - e) ^ 5*beta(a,b))
  }
  
  return(res)
}

# function to calculate max of the PERT
dPERTmax <- function(s, e, sp) {
  return(((s - e) / 2) + e)
}

# the resolution of the fossil dataset:
bins <- seq(from = 10, to = 0,
            by = -0.1)
# note that we will provide a very high resolution to test the function

dt <- sample.clade(sim, rr = 3, tMax = 10, bins = bins,
                   dFun = dPERT, dFunMax = dPERTmax, returnTrue = FALSE)

# extract species identity
ids <- unique(dt$Species)

# approximate sampling time (since it is a range)
mids <- (dt$MaxT - dt$MinT) / 2 + dt$MinT

# for each species
for (i in 1:length(ids)) {
  # get the species number
  sp <- unique(as.numeric(gsub("spp_", "", ids[i])))
  
  # check the histogram
  hist(mids[dt$Species == ids[[i]]],
       main = paste0("spp = ", sp, "; duration ~ ",
                     round(sim$TS[sp] - sim$TE[sp], digits = 2), "my"),
       xlab = "Time (My)", probability = TRUE)
  
  # expected curve
  curve(dPERT(x, s = sim$TS[sp], e = sim$TE[sp], sp = sp), from = sim$TE[sp],
        to = sim$TS[sp], add = TRUE, col = "red", n = 100)
}
# we provide curves for comparison here, but remember the low sample sizes 
# (and bins) may affect the quality of the fit. See vignettes for more 
# thorough testing

###
# now, a hat-shaped increase through the duration of a species dependent on two
# parameters

# simulate a group
sim <- bd.sim(n0 = 1, pp = 0.1, qq = 0.1, tMax = 10)

# in case first simulation is short-lived
while ((sim$TS[1] - ifelse(is.na(sim$TE[1]), 0, sim$TE[1])) < 10) {
  sim <- bd.sim(n0 = 1, pp = 0.1, qq = 0.1, tMax = 10)
}

# we will need to get exact durations for some examples, so
sim$TE[sim$EXTANT] <- 0
# this is necessary since the default is to have NA for extant species

# preservation function in respect to age, with the "mode" of the triangle
# being exactly at the last quarter of the duration of EACH lineage.
dTRImod2<-function(t, s, e, sp) {
  
  # make sure it is a valid TRI
  if (e >= s) {
    message("There is no TRI with e >= s")
    return(rep(NaN, times = length(t)))
  }
  
  # here is the difference from the function in example 3 and 4
  md <- par[sp] + par1[sp]
  
  # check that md is valid
  if (md < e | md > s) {
    message("There is no TRI with md outside [s, e] interval")
    return(rep(NaN, times = length(t)))
  }
  
  id1 <- which(t >= e & t < md)
  id2 <- which(t == md)
  id3 <- which(t > md & t <= s)
  id4 <- which(!(1:length(t) %in% c(id1,id2,id3)))
  
  res <- vector()
  
  res[id1] <- (2*(t[id1] - e)) / ((s - e)*(md - e))
  res[id2] <- 2 / (s - e)
  res[id3] <- (2*(s - t[id3])) / ((s - e)*(s - md))
  res[id4] <- 0
  
  return(res)
  #for more details in this function, see example 3 and 4
}

# maximum function
dTRImaxmod2 <- function(s, e, sp) {
  return(2 / (s - e))
}

# a random point inside each lineage's duration
par <- runif (n = length(sim$TE), min = sim$TE, max = sim$TS)

# a distance between "par" and the lineage's duration middle
par1 <- (((sim$TS - sim$TE) / 2) + sim$TE) - par

# the resolution of the fossil dataset:
bins <- seq(from = 10, to = 0,
            by = -0.1)
# note that we will provide a very high resolution to test the function

dt <- sample.clade(sim, rr = 4, tMax = 10, bins = bins,
                   dFun = dTRImod2, dFunMax = dTRImaxmod2, returnTrue = FALSE)

# extract species identity
ids <- unique(dt$Species)

# approximate sampling time (since it is a range)
mids <- (dt$MaxT - dt$MinT) / 2 + dt$MinT

# for each species
for (i in 1:length(ids)) {
  # get the species number
  sp <- unique(as.numeric(gsub("spp_", "", ids[i])))
  
  # check the histogram
  hist(mids[dt$Species == ids[[i]]],
       main = paste0("spp = ", sp, "; duration ~ ",
                     round(sim$TS[sp] - sim$TE[sp], digits = 2), "my"),
       xlab = "Time (My)", probability = TRUE)
  
  # expected curve
  curve(dTRImod2(x, e=sim$TE[sp], s=sim$TS[sp], sp = sp),from = sim$TE[sp],
        to = sim$TS[sp], add=TRUE, col="red", n = 100)
}
# we provide curves for comparison here, but remember the low sample sizes 
# (and bins) may affect the quality of the fit. See vignettes for more 
# thorough testing

brpetrucci/paleobuddy documentation built on Aug. 8, 2020, 2:03 a.m.