In joechip90/PaGAn: Paleo-Geography Analysis Package

knitr::opts_chunk$set(echo = TRUE)

Integral Projection Models in PaGAn

There are a number of fantastic packages for performing analysis of data using integral projection models (IPMs) in R such as IPMpack and impr. PaGAn also has its own implementation of Integral Projection Models. The PaGAn implementation of IPM has been designed with a focus on flexibility and an ability to embed the outputs from these models into further analysis. Under the hood PaGAn uses Bayesian techniques (powered by NIMBLE) which means that full posterior estimates of the coefficients in vital rate functions, kernels, and population-level summaries.

# This chunk will be removed once the package is completed - in the meantime it loads the required libraries for the analysis and imports the relevant source files from the project GitHub pages
library(nimble)
library(parallel)
library(coda)
library(DHARMa)
library(ggplot2)

source("https://raw.githubusercontent.com/joechip90/PaGAn/master/R/mcmcInternals.R")
source("https://raw.githubusercontent.com/joechip90/PaGAn/master/R/glmNIMBLE.R")
source("https://raw.githubusercontent.com/joechip90/PaGAn/master/R/ipmNIMBLE.R")

Example Dataset

For this example we with use the same data that is used in the great tutorial on IPM modeling in R by Cory Merow. This is a dataset on the plant Dracocephalum austriacum. This dataset is a single data frame with adults surviving from 2001 to 2002 given with their sizes in the columns 'size' and 'sizeNext' respectively and the number of seeds that they produced given in the column 'fec.seed'. These surviving adults are given a value '1' in the survival status column 'surv'. Adults that did not survive only have an entry for their 2001 size and have a value of '0' in the 'surv' column. Finally, new recruits are only given a size in 2002 ('sizeNext' column) and have no entries in the 'surv', 'size', and 'fec.seed' columns.

# Download the data
exData <- read.csv("https://cmerow.github.io/RDataScience/21_assets/Exercises/Intro_to_IPMs_Exercises_Data.csv")
# Show a snippet of the data
knitr::kable(head(exData))

Vital Rate Regressions

The first stage for running an IPM is to run the vital rate regressions. In PaGAn the user is allowed to have as many vital rate regressions as they need for their analysis and they can give those any name that they require. Following Cory's example we start with the survival regression function. This vital rate is a logistic regression model with the survival status as the response and size as the only predictor variable: the data are assumed to have a binomial error distribution and a logit link function is used.

# Define the model and the error family
surv.modelFormula <- cbind(surv, 1 - surv) ~ size
surv.errorFamily <- binomial(link = "logit")
# Sub-set the data frame so only the columns with relevant data are included
surv.inputData <- exData[!is.na(exData$surv) & !is.na(exData$size), ]

Next up is the sub-model for growth. Again, following Cory's example we use a Gaussian regression with the size in 2002 as the response variable and the size in 2001 as the predictor variable.

# Define the model (error family not needed because it is assumed to be Gaussian if not specified)
growth.modelFormula <- sizeNext ~ size
# Sub-set the data frame so only the columns with relevant data are included
growth.inputData <- exData[!is.na(exData$sizeNext) & !is.na(exData$size), ]

Finally, we provide a sub-model for fecundity. Here the number of seeds is a response variable depending on the size of the plant in 2001. The errors are assumed to be Poisson distributed and a log link function is used.

# Define the model and the error family
fec.modelFormula <- fec.seed ~ size
fec.errorFamily <- poisson(link = "log")
# Sub-set the data frame so only the columns with relevant data are included
fec.inputData <- exData[!is.na(exData$fec.seed) & !is.na(exData$size), ]

Because the model fitting mechanism in PaGAn uses Markov Chain Monte Carlo (MCMC) we can also specify some settings for the fitting algorithm. This is done by using a list with the following possible entries: * numRuns The number of iteration that each chain will run (after the 'burn-in' phase has ended) * numChains The number of chains to run * numBurnIn The number of samples at the start of each chain to discard * thinDensity Thinning parameter for the regression coefficients and error distribution parameters of the model. This should usually be kept at 1 unless you are needed to run the chains for so long that you are running into memory problems before convergence * predictThinDensity Thinning parameter for the predictions from the model. In cases where the dataset is very large then the number of samples for each of the prediction densities can cause memory problems. This parameter allows for separate control over thinning of the posterior prediction densities whilst maintaining more dense sampling of the regression coefficients

# Define the MCMC parameters
mcmcParams <- list(
  numRuns = 1000,
  numChains = 4,
  numBurnIn = 500,
  thinDensity = 1,
  predictThinDensity = 1
)

In PaGAn the vital rate models are all fitted simultaneously using the 'ipmNIMBLE' function. Each vital rate sub-model must be given a name and the naming convention used to pass arguments to the function must be of the format '[sub-model name].[argument]' where '[sub-model name]' is replaced by the name of your vital-rate sub-model and '[argument]' is replaced by an argument to pass to the sub-model. For example 'surv.modelFormula' would refer to the 'modelFormula' argument of the 'surv' sub-model. You can name your vital-rate sub-models in any way you like as long as you avoid characters that are not alphanumeric and that you don't start the name of the sub-model with a digit. You can provide as many sub-models as you like: in this example we are using three but you could have more or fewer as you feel neccessary for your application. The 'ipmNIMBLE' function understands the following sub-model arguments: * modelFormula The formula for the vital rate regression using the standard model notation used in R's glm function * errorFamily Specification for the family of the error distribution using R's family objects. If this is not provided then a Gaussian error model is assumed. * inputData The data to use for the fitting of the vital rate. This is optional and global objects will be used instead if this argument is not provided * regCoeffs An advanced setting that allows for regularization of the coefficients. This defaults to "none" but can be "ridge" or "lasso" if those types of regression are required

In addition to the sub-model arguments the 'ipmNIMBLE' has the following optional global arguments: * mcmcParams A list object containing the parameters for the MCMC * inputData A data frame that will be used as the input data is the 'inputData' arguments is not supplied to a sub-model * numCores Defaults to 1 but can set higher if you wish to use multiple cores to fit the model (sometimes useful for large datasets or where there are many vital rate functions). However, using multicore analysis on Window's systems can result in the status information not being displayed to the console

In this example we therefore call the function using the following arguments

# Run the IPM
ipmOut <- ipmNIMBLE(mcmcParams = mcmcParams,
  # Define the survival sub-model
  surv.modelFormula = surv.modelFormula, surv.errorFamily = surv.errorFamily, surv.inputData = surv.inputData,
  # Define the growth sub-model
  growth.modelFormula = growth.modelFormula, growth.inputData = growth.inputData,
  # Define the fecundity sub-model
  fec.modelFormula = fec.modelFormula, fec.errorFamily = fec.errorFamily, fec.inputData = fec.inputData
)

After the vital rate models have been fit then we can extract some of the model outputs. The resultant 'ipmOut' object is a list with a named argument for each of the vital rate sub-models. Each of the sub-models has a long list of outputs that can be explored. For example, for the survival submodel we can explore the following outputs:

# Retrieve the WAIC for the sub-model
ipmOut$surv$WAIC
# Retrieve the Bayesian R-squared for the sub-model
setNames(
  c(ipmOut$surv$rSquared$mean, ipmOut$surv$rSquared$quant0.025, ipmOut$surv$rSquared$quant0.975),
  c("expected", "lower95credible", "upper95credible"))
# Plot the DHARMa residual plots for the sub-model
plot(ipmOut$surv$DHARMaResiduals)
# Plot the margianl posterior distributions of the regression coefficients associated with the sub-model
ipmOut$surv$parameterFigure
# Plot the trace plots of the MCMC of regression coefficients for the sub-model
plot(ipmOut$surv$parameterSamples)
# Plot of the Gelman-Rubin convergence diagnostic shrink plots for the sub-model
gelman.plot(ipmOut$surv$parameterSamples)

Kernel Creation

The next step in any IPM analysis is to create kernels for life history characteristics that we are interested in. In PaGAn we define kernels by creating R functions that combines the predictions of the vital rate functions in such a way as to calculate the life history trait that you are interested in. If your function has any arguments of the form '[sub-model name].[aspect]' where '[sub-model name]' is the name of one of your vital rate sub-models then the relevant 'aspect' of that vital rate is provided to the function. The current aspects can be requested: * response A prediction from the vital rate sub-model (on the response scale) according to an input data set (see later) * linpred A prediction from the vital rate sub-model (on the linear predictor scale) according to an input data set (see later) * sd The standard deviation of the error distribution of the sub-model * var The variance of the error distribution of the sub-model * prec The precision of the error distribution of the sub-model * scale The scale of the error distribution of the sub-model (can only be used for error distributions that have a scale parameter)

We next define a data set for which we want to make predictions from. In this example, we want to make predictions of the probabilities of arriving at a size in 2002 given a size in 2001 so we can use the 'seq' function in R to make a sequnce of sizes that wish to use as 2001 values. This will be used as input data for the vital-rate model predictions ('response' and 'linpred') described earlier. Our models here all only use 'size' as a predictor variable so we only need to have size in the prediction data set. However, if our vital rate models had other covaraites (such as temperature or precipitation) then we would need to provide those in our prediction data set too.

# Create a prediction data set to evaluate the kernels on
testData <- data.frame(size = seq(0.0, 10.0, length.out = 100))
# Calcualte the distance between each prediction point
cellWidth <- diff(testData$size[1:2])

For example, the growth kernel of Cory's example would be defined by the following kernel function:

# Function to evaluate the growth kernel
growthKern <- function(x, growth.response, growth.sd, cellWidth) {
  dnorm(x, growth.response, growth.sd) * cellWidth
}

The first argument of the kernel function must be the value for which the probability density is being estimated (here the 2002 size after growth). We can then use the 'ipmKernelEvaluation' function to evaluate this function.

# Evaluate the growth kernel
growthMat <- ipmKernelEvaluation(ipmOut, testData, testData$size, growthKern, cellWidth = cellWidth)
# Plot the expected values of the growth kernel matrix
image(testData$size, testData$size, t(growthMat$expected), xlab = "size(t)", ylab = "size(t + 1)")
# Pull out a specific sample from the MCMC samples (here the 10th)
image(testData$size, testData$size, t(growthMat$samples[, , 10]), xlab = "size(t)", ylab = "size(t + 1)")

You can see from this output that 'ipmKernelEvaluation' function produces a list with two elements: expected which is the expected kernel matrix and samples which is the kernel matrix evaluated for each of the samples from the MCMC. Because this analysis is Bayesian we do not just have one kernel matrix but whole posterior distribution of kernel matrices. The user can also specify other arguments to be passed to the kernel function through providing as named arguments in the 'ipmKernelEvaluation' function.

# Evaluate the growth/survival kernel
growthSurvKern <- function(x, growth.response, growth.sd, surv.response, cellWidth) {
  growthKern(x, growth.response, growth.sd, cellWidth) * surv.response
}
growthSurvMat <- ipmKernelEvaluation(ipmOut, testData, testData$size, growthSurvKern, cellWidth = cellWidth)
image(testData$size, testData$size, t(growthSurvMat$expected), xlab = "size(t)", ylab = "size(t + 1)")

# Evaluate the recruitment kernel
establishProb <- sum(is.na(exData$size)) / sum(exData$fec.seed, na.rm = TRUE)
recsizeMean <- mean(exData$sizeNext[is.na(exData$size)])
recsizeSD <- sd(exData$sizeNext[is.na(exData$size)])
recKern <- function(x, fec.response, establishProb, recsizeMean, recsizeSD, cellWidth) {
  establishProb * fec.response * dnorm(x, recsizeMean, recsizeSD) * cellWidth
}
recMat <- ipmKernelEvaluation(ipmOut, testData, testData$size, recKern, establishProb = establishProb, recsizeMean = recsizeMean, recsizeSD = recsizeSD, cellWidth = cellWidth)
image(testData$size, testData$size, t(recMat$expected), xlab = "size(t)", ylab = "size(t + 1)")

# Evaluate the full kernel
fullKern <- function(x, growth.response, growth.sd, surv.response, fec.response, establishProb, recsizeMean, recsizeSD, cellWidth) {
  growthSurvKern(x, growth.response, growth.sd, surv.response, cellWidth) + recKern(x, fec.response, establishProb, recsizeMean, recsizeSD, cellWidth)
}
fullMat <- ipmKernelEvaluation(ipmOut, testData, testData$size, fullKern, establishProb = establishProb, recsizeMean = recsizeMean, recsizeSD = recsizeSD, cellWidth = cellWidth)
image(testData$size, testData$size, t(fullMat$expected), xlab = "size(t)", ylab = "size(t + 1)")
# Analyse the full kernel
fullMatAnalysis <- ipmKernelAnalysis(fullMat)
ggplot(data = data.frame(growthRate = fullMatAnalysis$assymPopGrowth)) + geom_histogram(aes(x = growthRate), fill = "#69b3a2", color = "#e9ecef") +
  geom_vline(xintercept = fullMatAnalysis$assymPopGrowthSummary[4], linetype = "dashed") + geom_vline(xintercept = fullMatAnalysis$assymPopGrowthSummary[5], linetype = "dashed") +
  geom_vline(xintercept = fullMatAnalysis$assymPopGrowthSummary[1]) +
  theme_classic() + ylab("Count") + xlab("Asymptotic Growth Rate")
ggplot(data = cbind(data.frame(size = testData$size), fullMatAnalysis$stableDistSummary)) +
  geom_ribbon(aes(x = size, ymin = lower95cred, ymax = upper95cred), fill = "#69b3a2", color = "#e9ecef") +
  geom_line(aes(x = size, y = mean)) +
  theme_classic() + ylab("") + xlab("Size")