1. Overview of Worked Example {-}

a) Goals and Background {-}

Goals: The goal of this worked example is for you to become familiar with creating and analyzing landscape genetic hypotheses using maximum likelihood population effect models (MLPE; Van Strien et al. 2012) and information theory. With this approach, we focus on evaluating evidence for the influence of the (intervening) landscape on link-based metrics (i.e. genetic distance). Incorporating node-based effects (i.e. pond size, etc.) will be covered in the gravity modeling lab.

We will re-analyze a dataset from Goldberg and Waits (2010) using a maximum likelihood population effect model (MLPE).

Here's a map of the study area. The lines represent those pair-wise links between sampling locations that we will be considering here (graph model: Delaunay triangulation).

Map of study area

b) Data set {-}

In previous labs, you have created genetic distance matrices, so here we are going to start with those data already complete, as are the landscape data. For MLPE, two vectors representing the nodes at the ends of each link must be created. These are already included in the data set (pop1 and pop2).

Each row in the file 'CSF_network.csv' represents a link between two pairs of sampling locations. The columns contain the following variables:

c) Required packages {-}

Install corMLPE if it is not yet installed.

if(!requireNamespace("corMLPE", quietly = TRUE)) 
  remotes::install_github("nspope/corMLPE")
library(LandGenCourse)
library(ggplot2)
#library(nlme)
#library(usdm)
#library(corMLPE)

d) Import data {-}

First, read in the dataset. It is named CSF (Columbia spotted frog) to differentiate it from the RALU dataset you have used in previous labs (same species, very different landscapes).

CSFdata <- read.csv(system.file("extdata", "CSF_network.csv", 
                                        package = "LandGenCourse"))
head(CSFdata) 

The variable solarinso is on a very different scale, which can create problems with model fitting with some methods. To avoid this, we scale it.

Note: we could use the scale function without specifying the arguments 'center' or 'scale', as both are 'TRUE' by default. The argument 'center=TRUE' centers the variable by subtracting the mean, the 'scale=TRUE' argument rescales the variables to unit variance by dividing each value by the standard deviation. The resulting variable has a mean of zero and a standard deviation and variance of one.

CSFdata$solarinsoz <- scale(CSFdata$solarinso, center=TRUE, scale=TRUE)

Take a look at the column names and make sure you and R agree on what everything is called, and on the data types (e.g., factor vs. character).

str(CSFdata)

The response Y in our MLPE models will be logDc.km, the genetic distance per geographic distance, log transformed to meet normality assumptions. This was used in the paper as a way to include IBD as the base assumption without increasing k. The landscape variables were then also estimated so that they would not necessarily increase with increasing distance between sites (e.g., proportion of forest between sites).

This is a pruned network, where the links are only those neighbors in a Delauney trangulation.

For this exercise, you will test 5 alternative hypotheses. At the end, you will be invited to create your own hypotheses from these variables and test support for them.

2. Fitting candidate models {-}

a) Alternative hypotheses {-}

In information theory, we evaluate evidence for a candidate set of hypotheses that we develop from the literature and theory. Although it is mathematically possible (and often practiced) to fit all possible combinations of input variables, that approach is not consistent with the methodology (see Burnham and Anderson 2002 Chapter 8, the reading for this unit, for more detail). Here, your "a priori"" set of hypotheses (as in the Week 12 Conceptual Exercise) is as follows:

  1. Full model: solarinso, forest, ag, shrub, dev
  2. Landcover model: ag, shrub, forest, and dev
  3. Human footprint model: ag, dev
  4. Energy conservation model: slope, shrub, dev
  5. Historical model: soils, slope, solarinso

b) Check for multicollinearity {-}

Before we add predictor variables to a model, we should make sure that these variables do not have issues with multicollinearity, i.e., that they are not highly correlated among themselves. To this end, we calculate variance inflation factors, VIF.

Make a dataframe of just the variables to test (note, you cannot use factors here):

CSF.df <- with(CSFdata, data.frame(solarinsoz, forest, dev, shrub, ag))
usdm::vif(CSF.df)

If you get an error that there is no package called 'usdm', use install.packages("usdm").

VIF values less than 10, or 3, or 4, depending on who you ask, are considered to not be collinear enough to affect model outcomes. So we are good to go here. What would happen if we added grass to this?

usdm::vif(cbind(CSF.df, grass=CSFdata$grass))

Question 1: Why would adding in the last land cover cause a high amount of collinearity? Consider how these data were calculated.

The variables 'forest', 'dev', 'shrub', 'ag' and 'grass' together cover almost all land cover types in the study area. Each is measured as percent area, and together, they make up 100% or close to 100% of the land cover types along each link. Thus if we know e.g. that all 'forest', 'dev', 'shrub' and 'ag' together cover 80% along a link, then it is a good bet that 'grass' will cover the remaining 20%.

With such sets of compositional data, we need to omit one variable to avoid multi-collinearity (where a variable is a linear combination of other variables), even though the pairwise correlations may be low. For example, there the variable 'grass' showed relatively low correlations with the other variables:

cor(CSF.df, CSFdata$grass)

c) Fit all five models {-}

The R package corMLPE makes it easy to fit an MLPE model to our dataset. We use the function gls from package nlme to fit a "generalized least squares" model to the data.

This is similar to a spatial regression model we've seen in Week 7. The main difference is how we define the correlation structure of the errors: instead of using a function of the spatial coordinates, we use a function of the population indicators: correlation=corMLPE(form=~pop1+pop2).

Note: for linear mixed models, see Week 6 videos. For gls models, see Week 7 videos.

mod1 <- nlme::gls(logDc.km ~ solarinsoz + forest + ag + shrub + dev, 
                  correlation=corMLPE::corMLPE(form=~pop1+pop2), 
                  data=CSFdata, method="REML")

This is the closest to a full model in our dataset (although our models are not completely nested), so we will take a look at the residuals.

plot(mod1, abline=c(0,0)) 

Residuals are centered around zero and do not show large groupings or patterns, although there is some increase in variation at larger numbers (heteroscedasticity: the plot thickens). Here, this means that the model is having a more difficult time predicting larger genetic distances per km. We will keep that in mind as we move on.

Next we will check to make sure the residuals are normally distributed. This is an important assumption for models fitted with (Restricted) Maximum Likelihood.

We'll check this with two plots:

par(mfrow=c(1,2))
hist(residuals(mod1)) 
qqnorm(residuals(mod1))
par(mfrow=c(1,1))

These residuals look okay so we will move on (but if this were a publication we would do a closer examination of the shape of the data at the ends of the distribution).

Recall that mod1 had the following formula: logDc.km ~ solarinsoz + forest + ag + shrub + dev.

Here we'll use the function update to change the list of predictors in the model (i.e., the right-hand side of the formula, starting with the tilde symbol ~).

mod2 <- update(mod1, ~ forest + ag + shrub + dev)
mod3 <- update(mod1, ~ ag + dev)
mod4 <- update(mod1, ~ slope + shrub + dev)
mod5 <- update(mod1, ~ soils + slope + solarinsoz)

d) Compare evidence for models {-}

Our goal here is to compare five models that all have the same random effects (the population effects modeled by the correlation structure) but differ in the fixed effects. Recall from the Week 6 videos:

We can use the function update again, this time to change the fitting method from method="REML" to method="ML".

mod1noREML <- update(mod1, method="ML")
mod2noREML <- update(mod2, method="ML")
mod3noREML <- update(mod3, method="ML")
mod4noREML <- update(mod4, method="ML")
mod5noREML <- update(mod5, method="ML")

In information theory, we use information criteria (AIC, BIC) to evaluate the relative distance of our hypothesis from truth. For more information, see Burnham and Anderson (2002).

Note: package MuMIn will do this for you, but for this exercise it is useful to see all the pieces of what goes into your evaluation.

Models <- list(Full=mod1noREML, Landcover=mod2noREML, HumanFootprint=mod3noREML, 
               EnergyConservation=mod4noREML, Historical=mod5noREML)
CSF.IC <- data.frame(AIC = sapply(Models, AIC),
                     BIC = sapply(Models, BIC)) 
CSF.IC

We now have some results, great!

Now we will work with these a bit. First, because we do not have an infinite number of samples, we will convert AIC to AICc (which adds a small-sample correction to AIC).

First, find the k parameters used in the model and add them to the table.

CSF.IC <- data.frame(CSF.IC, k = sapply(Models, function(ls) attr(logLik(ls), "df")))
CSF.IC

Question 2: How does k relate to the number of parameters in each model?

Disclaimer:

Calculate AICc and add it to the dataframe (experimental, see disclaimer above):

N = nrow(CSFdata)  # Number of unique pairs
CSF.IC$AICc <- CSF.IC$AIC + 2*CSF.IC$k*(CSF.IC$k+1)/(N-CSF.IC$k-1)
CSF.IC

e) Calculate evidence weights {-}

Next we calculate evidence weights for each model based on AICc and BIC. If all assumptions are met, these can be interpreted as the probability that the model is the closest to truth in the candidate model set.

Calculate model weights for AICc (experimental, see disclaimer above):

AICcmin <- min(CSF.IC$AICc)
RL <- exp(-0.5*(CSF.IC$AICc - AICcmin))
sumRL <- sum(RL)
CSF.IC$AICcmin <- RL/sumRL

Calculate model weights for BIC (experimental, see disclaimer above):

BICmin <- min(CSF.IC$BIC)
RL.B <- exp(-0.5*(CSF.IC$BIC - BICmin))
sumRL.B <- sum(RL.B)
CSF.IC$BICew <- RL.B/sumRL.B
round(CSF.IC,3)

Question 3: What did using AICc (rather than AIC) do to inference from these results?

In the multi-model inference approach, we can look across all the models in the dataset and their evidence weights to better understand the system. Here we see that models 1 and 5 have very little evidence of being close to the truth, that models 2 and 4 have more, and that model 3 has the most evidence of being the closest to truth, although how much more evidence there is for model 3 over 2 and 4 depends on which criterion you are looking at. BIC imposes a larger penalty for additional parameters, so it has a larger distinction between these. Let's review what these models are:

  1. Full model: solarinso, forest, ag, shrub, dev
  2. Landcover model: ag, shrub, forest, and dev
  3. Human footprint model: ag, dev
  4. Energy conservation model: slope, shrub, dev
  5. Historical model: soils, slope, solarinso

f) Confidence intervals for predictors {-}

The summary function for the model fitted with "ML" provides a model summary that includes a table with t-tests for each fixed effect in a Frequentist approach (see Week 6 Worked Example). Here we stay within the information-theoretic approach and use a different approach than significance testing to identify the most important predictors.

According to Row et al. (2017), we can use the confidence interval from the variables in the models to identify those that are contributing to landscape resistance. Looking at models 2 and 3, we have some evidence that shrub and forest cover matter, but it's not as strong as the evidence for ag and development. Let's look at the parameter estimates to understand more. Note that for parameter estimation we use the REML estimates.

ModelsREML <- list(Full=mod1, Landcover=mod2, HumanFootprint=mod3,
               EnergyConservation=mod4, Historical=mod5)

Now we'll estimate the 95% confidence interval for these estimates:

confint(ModelsREML$Landcover, level = 0.95, method = "Wald")
confint(ModelsREML$HumanFootprint, level = 0.95, method = "Wald")
confint(ModelsREML$EnergyConservation, level = 0.95, method = "Wald")

In essence, we use the confidence intervals to assess whether we can differentiate the effect of each variable from zero (i.e. statistical significance). The confidence intervals that overlap 0 do not have as much evidence of being important to gene flow, based on simulations in Row et al. (2017). So in this case, there is strong evidence that agriculture influences the rate of gene flow. Note that because the metric is genetic distance, negative values indicate more gene flow.

Question 4: What evidence is there that shrub and forest matter to gene flow from these analyses? What about slope? Would you include them in a conclusion about landscape resistance in this system?

Question 5: What evidence is there that development influences gene flow from these analyses? Would you include it in a conclusion about landscape resistance in this system?

Question 6: We excluded grass from our models because of multicollinearity. How could we investigate the importance of this variable?

3. Your turn to test additional hypotheses! {-}

Create your own (small) set of hypotheses to rank for evidence using the dataset provided. Modify the code above to complete the following steps:

  1. Define your hypotheses.
  2. Calculate the VIF table for full model.
  3. Make a residual plot for full model.
  4. Create table of AICc and BIC weights.
  5. Create confidence intervals from models with high evidence weights.

What can you infer from your analysis about what influences gene flow of this species?

4. Nested model (NMLPE) for hierarchical sampling designs {-}

Jaffe et al. (2019) proposed nested MLPE (NMLPE) to account for an additional level of hierarchical sampling. In their example, they had sampled multiple sites (clusters), and within each site, multiple colonies (populations).

When they fitted a MLPE model that did not account for the hierarchical level of sites, there was unaccounted autocorrelation in the residuals. They tested for this by sorting the pairwise distances by site and then calculating an autocorrelation function (ACF) as a function of lag distance along rows (i.e., the lag represented how many rows apart to values were in the dataset). The following pseudo-code is based on code in the supplementary material of Jaffe et al. (2019).

Sort data frame with pairwise data by site, then by pop (pseudo-code). Note: this is only needed for the ACF analysis.

# df <- df[order(df$site, df$pop1, df$pop2),]

MLPE model (pseudo-code):

# m1 <- gls(Genetic.distance ~ Predictor.distance, 
#           correlation = corMLPE(form = ~ pop1+pop2), data = df)
# acf(resid(m1, type='normalized')) ## Autocorrelation function

The package corMLPE contains another function, corNMLPE2, where we can specify cluster as an additional hierarchical level of non-independence in the data. The ACF of this model did not show any autocorrelation anymore.

NMLPE model (pseudo-code)

# m2 <- gls(Genetic.distance ~ Predictor.distance, 
#           correlation = corMLPE(form = ~ pop1+pop2, clusters = site), data = df)
# acf(resid(m2, type='normalized')) ## Autocorrelation function

Note that this is a method to account for hierarchical sampling designs. It does not explicitly account for spatial autocorrelation - indeed, no spatial information is included, unless Euclidean distance is included as a predictor in the model.

5. References {-}

LandGenCourse::detachAllPackages()


hhwagner1/LandGenCourse documentation built on Feb. 17, 2024, 4:42 p.m.