In hhwagner1/LandGenCourse: Interface for course "Landscape Genetic Data Analysis with R"

1. Overview of Worked Example {-}

a) Goals {-}

This worked example shows:

How to test regression residuals for spatial autocorrelation.
How to fit a model with spatially autocorrelated errors (GLS).
How to fit a spatial simultaneous autoregressive error model (SAR).
How to perform spatial filtering with Moran eigenvector maps (MEM).
How to fit a spatially varying coefficients model (SVC).

b) Data set {-}

Here we analyze population-level data of the wildflower Dianthus carthusianorum (common name: Carthusian pink) in 65 calcareous grassland patches in the Franconian Jura, Germany (Rico et al. 2013):

Dianthus: 'sf' object with population-level data (patch characteristics, grazing regime, genetic diversity, 15 alternative connectivity indices Si) for sampling locations, included in package 'LandGenCourse'. To load the data, type (without quotes): 'data(Dianthus)'. For a definition of the variables, type: '?Dianthus'.

c) Required R packages {-}

Note: the function 'library' will always load the package, even if it is already loaded, whereas 'require' will only load it if it is not yet loaded. Either will work.

library(LandGenCourse)
#library(here)
#libraryspdep)
library(nlme)
#library(lattice)
#library(MuMIn)
#library(gridExtra)
library(dplyr)
library(spatialreg)
library(ggplot2)
library(tmap)
#library(sf)
source(system.file("extdata", "panel.cor.r", 
                            package = "LandGenCourse"))

Package 'spmoran' not automatically installed with 'LandGenCourse':

if(!require(spmoran)) install.packages("spmoran", repos='http://cran.us.r-project.org')
#require(spmoran)

2. Explore data set {-}

We will model allelic richness 'A' as a function of the following predictors:

IBD: connectivity index Si ('Eu_pj') based on Euclidean distance between source and focal patch. This represents a hypothesis of isolation by distance (IBD).
IBR: connectivity index Si ('Sheint_pj') based on the number of continuously or intermittently grazed patches between source and focal patch. This represents a hypothesis of isolation by resistance (IBR). Specifically, this model assumes connectivity via sheep-mediated seed dispersal, where seeds are likely to be transported from patch to patch within the same grazing system (shepherding route). Seeds are assumed to disperse most likely to the next patch (in either direction) along the grazing route, and less likely to more remote patches along the route.
PatchSize: Logarithm of calcareous grassland patch size in ha.

Bonus Materials: The connectivity indices Si were calculated for each focal patch i, integrating over all other patches j where the species was present (potential source patches) using Hanski's incidence function. See the Week 7 Bonus Material for how this was done!

a) Import data {-}

data(Dianthus)

Allelic richness 'A' was not calculate for populations with < 5 individuals. Here we extract only the patches with 'A' values, and the variables needed, and store them in a data frame 'Dianthus.df'.

b) Create a map {-}

With tmap (see Week 3 Worked Example), mapping the points is easy. Here, we use color to indicate allelelic richness.

Note that tmap internally converts the CRS to lat/long to plot the points on a basemap from the internet. See here for a list of available basemaps: https://leaflet-extras.github.io/leaflet-providers/preview/

tmap_mode("view")
tm_basemap(c("Esri.WorldTopoMap", "Esri.WorldStreetMap", "Esri.WorldShadedRelief")) +
tm_shape(Dianthus) + tm_sf(col="A")

Toggle between the basemaps to visualize the topographic relief and forest cover.

As you can see from the shaded relief, most sites lie on the steep slopes between an upper and a lower Jurassic plateau. A few sites lie at the forest edge on the upper plateau, typically in areas where the soil is too shallow to allow crop farming. With in the study area, all known sites were sampled. Additional sites are expected to be found mainly in the valley system in the Southwest.

c) Explore correlations {-}

When fitting linear models, it is always a good idea to look at the correlations first.

Dianthus.df <- data.frame(A=Dianthus$A, IBD=Dianthus$Eu_pj, 
                          IBR=Dianthus$Sheint_pj,
                          PatchSize=log(Dianthus$Ha),
                          System=Dianthus$System,
                          Longitude=Dianthus$Longitude,
                          Latitude=Dianthus$Latitude,
                          st_coordinates(Dianthus))

# Define 'System' for ungrazed patches
Dianthus.df$System=as.character(Dianthus$System)
Dianthus.df$System[is.na(Dianthus.df$System)] <- "Ungrazed"
Dianthus.df$System <- factor(Dianthus.df$System, 
                             levels=c("Ungrazed", "East", "South", "West"))

# Remove patches with missing values for A
Dianthus.df <- Dianthus.df[!is.na(Dianthus.df$A),]
dim(Dianthus.df)

Create a scatterplot matrix. The variables are plotted against each other and labeled along the diagonal. You will find histograms on the diagonal, scatterplots and a smooth line in the lower triangle, and the linear correlation coefficient r (with p-value) in the upper triangle. Stronger correlations are indicated with a larger font. You may ignore any warnings about graphical parameters.

graphics::pairs(Dianthus.df[,-c(5:7)], lower.panel=panel.smooth, 
      upper.panel=panel.cor, diag.panel=panel.hist)

Questions:

How strong is the linear relationship between 'Eu_pj' and 'A'? What does this suggest about the hypothesis of IBD?
How strong is the linear relationship between 'Sheint_pj' and 'A'? What does this suggest about the hypothesis of sheep-mediated gene flow (IBR)?
Which variable seems to be a better predictor of allelic richness: patch size 'Ha' or the logarithm of patch size, 'PatchSize'?
Is logHa correlated with 'IBD' or 'IBR'?
Are any of the variables correlated with the spatial coordinates X and Y?

Do the three grazing systems, and the ungrazed patches, differ in allelic richness A? Also, let's check the association between patch size and population size. Here we create boxplots that show the individual values as dots. We add a horizontal jitter to avoid overlapping points.

Boxplot1 <- ggplot(Dianthus, aes(x=System, y=A)) + 
  geom_boxplot() + xlab("Grazing system") + ylab("Allelic richness (A)") +
  geom_jitter(shape=1, position=position_jitter(0.1), col="blue")

Boxplot2 <- ggplot(Dianthus, aes(x=factor(pop09), y=log(Ha))) + 
  geom_boxplot() + xlab("Population size class") + ylab("PatchSize (log(Ha))") +
  geom_jitter(shape=1, position=position_jitter(0.1), col="blue")

gridExtra::grid.arrange(Boxplot1, Boxplot2, nrow=1)

Even though the population size categories were very broad, there appears to be a strong relationship between populations size (category) and (the logarithm of) patch size.

Despite this relationship, connectivity models Si that only considered Dianthus carthusianorum presence/absence ('pj') in source patches 'j' were better supported than those Si models that took into account source patch area ('Aj') or population size ('Nj').

We can check this by calculating the correlation of allelelic richness 'A' with each of the 15 connectivity models 'Si' in the data set.

round(matrix(cor(Dianthus$A, st_drop_geometry(Dianthus)[,15:29], 
                 use="pairwise.complete.obs"), 5, 3, byrow=TRUE, 
           dimnames=list(c("Eu", "Shecte", "Sheint", "Shenu", "Forest"), 
                         c("pj", "Aj", "Nj"))),3)

Correlations with 'A' are highest for the two 'IBR' models that assume seed dispersal over a limited number of patches along shepherding routes ('Shecte' and 'Sheint'. These two models include only continuously grazed, or both continuously and intermittently grazed patches, respectively.
Correlations for models that take in to account population size ('Nj') are only slightly lower, whereas those that use patch size ('Aj') as a proxy for the size of the seed emigrant pool had lower correlations.

3. Test regression residuals for spatial autocorrelation {-}

a) Fit regression models {-}

Here we fit three multiple regression models to explain variation in allelic richness:

mod.lm.IBD: IBD model of connectivity 'Eu_pj'.
mod.lm.IBR: IBR model shepherding connectivity 'Sheint_pj'.
mod.lm.PatchSize: log patch size and IBR model.

mod.lm.IBD <- lm(A ~ IBD, data = Dianthus.df)
summary(mod.lm.IBD)

This model does not fit the data at all!

mod.lm.IBR <- lm(A ~ IBR, data = Dianthus.df)
summary(mod.lm.IBR)

This model fits much better. Let's check the residuals plots.

par(mfrow=c(2,2), mar=c(4,4,2,1))
plot(mod.lm.IBR)
par(mfrow=c(1,1))

The residuals show some deviation from a normal distribution. Specifically, the lowest values are lower than expected.

mod.lm.PatchSize <- lm(A ~ PatchSize + IBR, data = Dianthus.df)
summary(mod.lm.PatchSize)

This combinde model explains more variation in allelic richness than the IBR model alone. Moreover, after adding PatchSizes, the IBR term is no longer statistically significant!

Has the distribution of residuals improved as well?

par(mfrow=c(2,2), mar=c(4,4,2,1))
plot(mod.lm.PatchSize)
par(mfrow=c(1,1))

Not really!

b) Test for spatial autocorrelation (Moran's I) {-}

Before we interpret the models, let's check whether the assumption of independent residuals is violated by spatial autocorrelation in the residuals.

To calculate and test Moran's I, we first need to define neighbours and spatial weights. Here we use a Gabriel graph to define neighbours.

We define weights in three ways (see Week 5 video and tutorial for explanation of code):

listw.gab: 1 = neighbour, 0 = not a neighbour.
listw.d1: inverse distance weights: neighbour j with weight 1/dij
listw.d2: inverse squared distance weights: neighbour j with weight 1/dij^2

In each case, we row-standardize the weights with the option 'style = "W"'.

Note: when using 'graph2nb', make sure to use the argument 'sym=TRUE'. This means that if A is a neighbour of B, B is also a neighbour of A. The default is 'sym=FALSE', which may result in some sites not having any neighbours assigned (though this would not be evident from the figure!).

xy <- data.matrix(Dianthus.df[,c("X", "Y")])
nb.gab <- spdep::graph2nb(spdep::gabrielneigh(xy), sym=TRUE)
par(mar=c(0,0,0,0))
plot(nb.gab, xy)
listw.gab <- spdep::nb2listw(nb.gab)

dlist <- spdep::nbdists(nb.gab, xy)
dlist <- lapply(dlist, function(x) 1/x)
listw.d1 <- spdep::nb2listw(nb.gab, style = "W", glist=dlist)
dlist <- lapply(dlist, function(x) 1/x^2)
listw.d2 <- spdep::nb2listw(nb.gab, style = "W", glist=dlist)

Now we can quantify and test Moran's I for each variable to test for spatial autocorrelation in response and predictor variables. For now, we'll take the simple weights 'listw.gab'.

Allelic richness A:

spdep::moran.test(Dianthus.df$A, listw.gab)

IBD:

spdep::moran.test(Dianthus.df$IBD, listw.gab)

IBR:

spdep::moran.test(Dianthus.df$IBR, listw.gab)

PatchSize:

spdep::moran.test(Dianthus.df$PatchSize, listw.gab)

Questions:

Which variables showed statistically signficant spatial autocorrelation?
Which variables showed the strongest autocorrelation? Is this surprising?

Next, let's test each model for autocorrelation in the residuals:

IBD:

spdep::lm.morantest(mod.lm.IBD, listw.gab)

IBR:

spdep::lm.morantest(mod.lm.IBR, listw.gab)

PatchSize:

spdep::lm.morantest(mod.lm.PatchSize, listw.gab)

Quite a bit of the spatial autocorrelation in allelic richness can be explained by the spatial structure in the predictors IBR and PatchSize. There is still statistically significant spatial autocorrelation in the residuals, though it is not strong any more.

4. Fit models with spatially correlated error (GLS) with package 'nlme' {-}

One way to account for spatial autocorrelation in the residuals is to fit a Generalized Least Squares model (GLS) with a spatially autocorrelated error structure. See also: http://rfunctions.blogspot.ca/2017/06/how-to-identify-and-remove-spatial.html

a) Plot empirical variogram {-}

The error structure in a GLS is defined in a geostatistical framework, based on a variogram and as a function of distance between observations. Hence we start with plotting an empirical variogram of the residuals, with a smooth line. Here we specify 'resType = "normalized", which means that the variogram will be fitted to the normalized residuals of the model.

The expected value of the semivariance will be 1. Hence it would make sense to add a horizontal line at 1. However, this is cumbersome with the trellis graphics (using package 'lattice') used by 'nlme'.

model.lm <- nlme::gls(A ~ IBR + PatchSize, data = Dianthus.df, method="REML")
semivario <- nlme::Variogram(model.lm, form = ~X  + Y, resType = "normalized")

If you want to create your own figure, e.g. with 'ggplot2', you can access the values stores in the data frame 'semivario' to plot the points, and add a smooth line yourself. Then we can add a horizontal line with 'geom_hline'.

ggplot(data=semivario, aes(x=dist, y=variog)) + geom_point() + geom_smooth(se=FALSE) +
  geom_hline(yintercept=1) + ylim(c(0,1.3)) + xlab("Distance") + ylab("Semivariance")

Question:

What do you conclude from this empirical variogram?
Estimate the range of the variogram from the intersection of the smooth line with the horizontal line.
Estimate the nugget effect from the intercept at Distance = 0.

b) Fit variogram models {-}

We can ask R to fit different types of variogram models to this empirical variogram. The model family (e.g., exponential, gaussian, spherical) determines the general shape of the curve that will be fitted. With 'nugget=T', we indicate that a nugget effect should be fitted.

Note: Here we want to compare mixed models with the same fixed effects but different random effect structures defined by correlation. For this, we use REML. If we wanted to compare models with the same random effects but different fixed effects (as in Week 6), we should use maximum likelihood.

With function lme4::lmer, we can set REML=TRUE for REML and REML=FALSE for ML.
Here with nlme::gls, we set method="REML" for REML and method="ML" for ML.
For update to work here, you'll need to load the library nlme(or use this, with three coons to access a function 'update', for 'lme' objects, that is not 'exported' from the package: nlme:::update.lme)

model.lm <- nlme::gls(A ~ IBR + PatchSize, data = Dianthus.df, method="REML")

mod.corExp <- update(model.lm, correlation = nlme::corExp(form = ~ X + Y, nugget=T))
mod.corGaus <- update(model.lm, correlation = nlme::corGaus(form = ~ X + Y, nugget=T))
mod.corSpher <- update(model.lm, correlation = nlme::corSpher(form = ~ X + Y, nugget=T))
mod.corRatio <- update(model.lm, correlation = nlme::corRatio(form = ~ X + Y, nugget=T))
#mod.corLin <- update(model.lm, correlation = nlme::corLin(form = ~ X + Y, nugget=T))

c) Select best-fitting model {-}

Now we compare all models for which we did not get an error message:

MuMIn::model.sel(model.lm, mod.corExp, mod.corGaus, mod.corSpher, mod.corRatio)

The list sorts the models, with the best model on top. The last column 'weight' contains the model weight, which indicate how much support there is for each model, given all other models in the set (see Week 12). Here, the exponential model fitted best, though the ratio model and the model without a spatially correlated error structure fitted the data almost equally well. The top three models have delta values within 2 (in fact, close to 0).

We refit the best model with maximum likelihood to test the fixed effects.

mod.corExp.ML <- nlme::gls( A ~ PatchSize + IBR, data = Dianthus.df, method="ML",
                            correlation = nlme::corExp(form = ~ X + Y, nugget=T))
car::Anova(mod.corExp.ML)

The fitted model with the exponential error structure shows a significant effect for PatchSize but not for the IBR term.

We don't get an R-squared value directly, but we can calculate a pseudo R-squared from a regression of the response 'A' on the fitted values (using the model fitted with REML). Let's compare it to the R-squared from the lm model.

summary(lm(A ~ fitted(mod.corExp), data = Dianthus.df))$r.squared
summary(mod.lm.PatchSize)$r.squared

The pseudo R-squared is almost identical to the R-squared of the non-spatial lm model.

Let's check the residual plots:

predictmeans::residplot(mod.corExp)

The normal probability plot still looks about the same.

Note that the function residplot recognized that we have a gls model and added a plot of the auto-correlation function, ACF. Here we have a value of 1 for the distance lag 0, which is the comparison of each value with itself. All other values are low. Recall that the autcorrelation is inversely related to the semivariogram (which does not report the value for lag = 0):

semivario <- nlme::Variogram(mod.corExp, form = ~ X + Y, 
                             resType = "normalized")
plot(semivario, smooth = TRUE)

The variogram of the residuals (after accounting for spatial autocorrelation as modeled by the variogram model) does look better!

d) Plot fitted variogram model {-}

How can we plot the fitted variogram? Let's first store it in an object 'Fitted.variog', then plot it. Note that the fitted variogram itself has two classes, "Variogram" and "data.frame". The plot created by plot(Fitted.variog) is a "trellis" object.

Fitted.variog <- nlme::Variogram(mod.corExp)
class(Fitted.variog)
class(plot(Fitted.variog))
plot(Fitted.variog)

That was easy. However, trellis plots are difficult to tweak, and we may want to create our own plot with ggplot2. For this, we need to access the fitted variogram values (i.e. the exponential model curve values), which is a bit more involved.

The object 'Fitted.variog' is a data frame (S3) with additional attributes. This raises a challenge, because we access attributes of S3 objects with $, but we also use $ to access columns in a data frame.

If we just print Fitted.variog, we only see the data frame.

head(Fitted.variog)

We can see the attributes listed by using str:

str(Fitted.variog)

The line we are looking for is: - attr(*, "modelVariog"). The attribute modelVariog has 50 rows (obs.) and 2 variables: $variog and $dist. These are the fitted values (i.e., values of the exponential variogram model for 50 distance values).

The notation attr(*, "modelVariog") is a cryptic way of telling us how to access the attribute: use the function attr and provide two arguments: the object names Fitted.variog (represented by the asterisk), and the name of the attribute, in quotes: attr(Fitted.variog, "modelVariog").

tibble::as_tibble(attr(Fitted.variog, "modelVariog"))

This is useful to know if you want to create your own figures, e.g. with ggplot2.

ggplot(data=Fitted.variog, aes(x=dist, y=variog)) + geom_point() + 
  ylim(c(0,1.3)) + xlab("Distance") + ylab("Semivariance") + 
  geom_line(data=attr(Fitted.variog, "modelVariog"), aes(x=dist, y=variog), color="blue") +
  geom_hline(yintercept=1,linetype="dashed")

e) Add random factor {-}

The package nlme allows us also to include random factors. Here we add System as a random factor and test whether this would improve the model fit.

Instead of function nlme::gls, we use the function nlme::lme.
In nlme, random effects are specified differently from lmer::lme4 (Week6): random = ~ 1 | System.
The correlation structure is specified exactly as with gls.

mod.lme.corExp <- nlme::lme( A ~ PatchSize + IBR, 
                             random = ~ 1 | System, data = Dianthus.df, 
                            correlation = nlme::corExp(form = ~ X + Y, nugget=T),
                            method="REML")
summary(mod.lme.corExp)

The nature of the results did not change: PatchSize is still significant but IBR is not.

As in Week 6, we can obtain marginal (fixed effects) and conditional R-squared values (fixed + random):

MuMIn::r.squaredGLMM(mod.lme.corExp)

Now we can include this model in the model comparison from above. Notes:

As discussed in the Week 6 video, we should fit the mixed model with maximum likelihood (method = "ML") to test fixed effects and to compare its AIC to the other models.
The Week 6 video followed the philosphy that random effect should only be fitted for factors with >5 levels, whereas here, we are using a factor with 4 levels.

MuMIn::model.sel(model.lm, mod.corExp, mod.corRatio, mod.lme.corExp)

Questions:

How may degrees of freedom (df) were used for the random effect?
How can you see from the table that System was fitted as a random effect, and what method was used (REML vs. ML)?
Was the model with the random effect ranked higher than the model without it?
Compare the estimates of the slope coefficients for IBR between the models. How did accounting for spatial autocorrelation affect the slope coefficient, compared to model.lm? How large was the difference due to using different variogram models? And how much of a difference was related to including the random effect?
How about the slope estimate for PatchSize?

5. Fit spatial simultaneous autoregressive error models (SAR) {-}

An alternative way to account for spatial autocorrelation in the residuals is spatial regression with a simultaneous autoregressive error model (SAR).

a) Fit and compare alternative SAR models {-}

The method errorsarlm fits a simultaneous autoregressive model ('sar') to the error ('error') term of a 'lm' model.

This approach is based on spatial neighbours and weights. We have already defined them in three versions of a listw object. Let's see which one fits the data best. First, we fit the three models:

mod.sar.IBR.gab <- spatialreg::errorsarlm(A ~ PatchSize + IBR, data = Dianthus.df, 
                                 listw = listw.gab)
mod.sar.IBR.d1 <- spatialreg::errorsarlm(A ~ PatchSize + IBR, data = Dianthus.df, 
                                 listw = listw.d1)
mod.sar.IBR.d2 <- spatialreg::errorsarlm(A ~ PatchSize + IBR, data = Dianthus.df, 
                                 listw = listw.d2)

Due to some issues when using model.sel with these objects, here we manually compile AICc and delta values and sort the models by delta:

#MuMIn::model.sel(mod.lm.IBR, mod.sar.IBR.gab, mod.sar.IBR.d1, mod.sar.IBR.d2) 

Models <- list(mod.lm.IBR=mod.lm.IBR, mod.sar.IBR.gab=mod.sar.IBR.gab, 
               mod.sar.IBR.d1=mod.sar.IBR.d1, mod.sar.IBR.d2=mod.sar.IBR.d2)
data.frame(AICc = sapply(Models, MuMIn::AICc)) %>% 
  mutate(delta = AICc - min(AICc)) %>%
  arrange(delta)

The best model ('mod.sar.IBR.d1') is the one with (row-standardized) inverse-distance weights ('listw.d1'). It is only slightly better than the model with the (row-standardized) binary weights ('listw.gab'), whereas the nonspatial model and the one with (row-standardized) inverse squared distance weights have much less support.

b) Interpret best-fitting SAR model {-}

Let's have a look at the best model. With the argument Nagelkerke = TRUE, we request a pseudo R-squared.

summary(mod.sar.IBR.d1, Nagelkerke = TRUE)

Again, PatchSize is significant but not IBR.
The section starting with 'Lamba' summarizes the fitted spatial autocorrelation term. It is not statistically significant (p-value = 0.1039 for the Likelihood Ratio test LR).

6. Spatial filtering with MEM using package 'spmoran' {-}

See tutorial for 'spmoran' package: https://arxiv.org/ftp/arxiv/papers/1703/1703.04467.pdf

Both GLS and SAR fitted a spatially correlated error structure of a relatively simple form to the data. Gene flow could be more complex and for example, could create spatial autocorrelation structure that is not the same in all directions or in all parts of the study area. Moran Eigenvector Maps (MEM) allows a more flexible modeling of spatial structure in the data. In spatial filtering, we use MEM spatial eigenvectors to account for any spatial structure while fitting and testing the effect of our predictors.

a) Default method {-}

The new package spmoran makes this really easy. First, we create the MEM spatial eigenvectors. This implies defining neighbors and weights, but this is well hidden in the code below. The function meigen here takes the coordinates, calculates a minimum spanning tree (so that each site has at least one neighbour), and finds the maximum distance 'h' from the spanning tree. It then calculates neighbor weights as exp(-dij / h).

Note: if you have many sites (> 200), the function meigen_f may be used instead of meigen, it should even work for >1000 sites.

The function esf then performs the spatial filtering. Here it uses stepwise selection of MEM spatial eigenvectors using an R-squared criterion (fn = "r2").

# lm model: using truncated distance matrix (max of min spanning tree distance)
meig <- spmoran::meigen(coords=xy)
sfd.res <- spmoran::esf( y=Dianthus.df$A, x=Dianthus.df[,c("PatchSize", "IBR")],
                       meig=meig, fn = "r2" )

The objects created by functions 'meigen' and 'esf' contain a lot of information:

meigW: a list returned by function 'meigen', with the following attributes:
- sf: Matrix of retained spatial eigenvectors.
- ev: Eigenvalues of retained spatial eigenvectors.
- ev_full: All (n - 1) eigenvalues.
sfd.res: a list returned by function 'esf', with the following attributes:
- b: Table with regression results for predictors X.
- r: Table with regression results the selected MEM spatial eigenvectors (based on step-wise eigenvector selection).
- e: Summary statistics for the entire model.
- vif: Variance inflation factors.
- sf: Fitted spatially dependent component (i.e., fitted value based on significant MEM spatial eigenvectors)
- pred: Fitted values.
- resid: Residuals.

Let's look at the table 'b' with regression results for the predictors first:

sfd.res$b

Again, PatchSize is statistically significant but not IBR.

Next, we look at the table 'r' with regression results for MEM spatial eigenvectors:

sfd.res$r

Five MEM spatial eigenvectors were important enough to be included in the model. Here they are ranked by their (absolute value of) slope coefficient, and thus by the strength of their association with the response variable. Eigenvector 'sf6' was by far the most important.

Note: some eigenvectors are included despite having a p-value > 0.05. This may have two reasons. First, the eigenvectors were selected without taking into account predictors X. Second, a different test was used in the stepwise eigenvector selection. The type of test can be specified with an argument fn (see '?esf' helpfile and 'spmoran' tutorial).

Finally, let's look at the summary results for the fitted model:

sfd.res$e

Here, adjR2 is rather high (0.437), but this includes the selected MEM spatial eigenvectors!

b) Using a custom connectivity matrix {-}

We know already that listw.d1 fit the data well, so let's re-run the model with our own definition of spatial weights. With the funciton 'listw2mat', we convert from listw format to a full connnectivity matrix.

cmat.d1    <- spdep::listw2mat( listw.d1) 
meigw  <- spmoran::meigen( cmat = cmat.d1 )
sfw.res <- spmoran::esf( y=Dianthus.df$A, x=Dianthus.df[,c("PatchSize", "IBR")],
                       meig=meigw, fn = "r2" )
sfw.res$b
tibble::as_tibble(sfw.res$r)
sfw.res$e

Note: the messages tell us that 'cmat' has been made symmetric before analysis, that 27 out of 59 MEM spatial eigenvector (and their eigenvalues, hence 'pairs') were retained initially and subjected to stepwise selection, which then returned 15 statistically significant MEM eigenvectors that were included in the regression model with the predictor variables X (PatchSize and IBR).

Questions:

Does this model fit the data better? Look for a lower AIC. In addition, you can compare the adjusted R-squared.
What could cause a difference in model performance?
Does this affect the results for PatchSize and IBR? Compare both parameter estimates and p-value between two two models.

c) Plot spatial eigenvectors {-}

So far, we have treated the MEM spatial eigenvectors as a black box. What kind of patterns do they represent?

First, we plot all the selected (significant) eigenvectors. A convenient way to do so is converting to an sf object and then use the function plot.

Here we need tweak the column names of the eigenvectors, which are called "X1", "X2" etc., to show which spatial eigenvectors (sf1, sf2, etc.) are being plotted. We will plot all 15 eigenvectors that were selected above, in order of importance.

In the first line, we create a data frame 'MEM' that combines the coordinates from 'xy' with the eigenvectors from 'meigw', ordered by importance.
We add the names for the eigenvectors. sfw.res$other$sf_id contains the numbers (ID's) of the selected eigenvectors.
Then we convert the data frame MEM to an sf object with st_as_sf.
The plot function for sf objects will plot each attribute. Here we specify that the axes should be plotted (axes=TRUE), but no ticks along the axes (xaxt, yaxt) should be shown - thus, only a box will be drawn around each plot. By default, the first ten variables will be plotted. To show all variables, we use the argument max.plot = ncol(MEM) - 1.

MEM <- data.frame(xy, meigw$sf[,sfw.res$other$sf_id])
names(MEM)[-c(1,2)] <- paste0("sf", sfw.res$other$sf_id)

MEM <- st_as_sf(MEM, coords=c("X", "Y"))
plot(MEM, axes=TRUE, yaxt = "n", xaxt = "n", max.plot = ncol(MEM) - 1)

The most important spatial eigenvector (sf6) is plotted at the top left, the second most important (sf23) second from left, etc.

The smallest numbers are patterns with the largest spatial scale (sf1), which here shows a gradient from East to West. The most important eigenvector (sf6) shows a finer-scale pattern with the highest values (yellow) in the center, lowest values East and West, and intermediate values North and South.

However, these patterns individually are not meaningful. More importantly, we can plot the total spatial component in the response as a weighted sum of these component patterns (MEM$wmean), where the weights correspond to the regression coefficients (Estimate) in table sfw.res$r. Here we create a panel with two plots, the modeled spatial components MEM.w on the left and the response allelic richness A on the right (the mean has been removed to make values comparable).

Note: you could calculate the weighted mean sfw.res$sf yourself as follows: data.matrix(st_drop_geometry(MEM[,1:15])) %*% sfw.res$r$Estimate

MEM$wmean <- sfw.res$sf
#MEM$pred <- sfw.res$pred
MEM$A <- scale(Dianthus.df$A, scale = FALSE)
plot(MEM[c("wmean", "A")])

Obviously, a big part of the variation in allelic richness is already captured by the weighted mean MEM$wmean. In essence the model then tries to explain the difference between these two sets of values by the predictors "PatchSize" and "IBR".

Let's quantify the correlation of this spatial component with allelic richness, and compare the correlation between the two models:

cor(Dianthus.df$A, data.frame(sfd=sfd.res$sf, sfw=sfw.res$sf))

With the default method (defining neighbors based on a distance cut-off), the spatial component modeled by the significant MEM spatial eigenvectors showed a correlation of 0.625 with the response variable. Using a Gabriel graph with inverse distance weights increased this correlation to 0.805.

This means that the spatial eigenvectors derived from the Gabriel graph were more effective at capturing the spatial variation in allelic richness than the default method. This spatial component is then controlled for when assessing the relationship between allelic richness and the predictors (PatchSize and IBR).

d) Random effect model {-}

The previous model selected 15 MEM spatial eigenvectors, and thus fitted 15 additional models. Just like the random effects for family and population in Week 6 lab, we can save a few parameters here by fitting the set of MEM eigenvectors as a random effect. This is done by the function 'resf'.

sfr.res <- spmoran::resf( y=Dianthus.df$A, x=Dianthus.df[,c("PatchSize", "IBR")], 
               meig = meigw, method = "reml" ) 
sfr.res$b
tibble::as_tibble(sfr.res$r)
sfr.res$e
sfr.res$s

As in Week 6 lab, the conditional R-squared is the variance explained by the fixed effects (PatchSize and IBR) and the random effects (significant MEM spatial eigenvectors) together. It is adjusted for the number of effects that were estimated.

Note: we can't compare AIC with the previous models, as the model was fitted with 'reml'.

We get an additional output 'sfr.res$s' with two variance parameters:

random_SE: standard error of the random effect (spatial component).
Moran.I/max(Moran.I): Moran's I of the spatial component, rescaled by the maximum possible value. From the help file: "Based on Griffith (2003), the scaled Moran'I value is interpretable as follows: 0.25-0.50:weak; 0.50-0.70:moderate; 0.70-0.90:strong; 0.90-1.00:marked."

7. Fit spatially varying coefficients model with package 'spmoran' {-}

See: https://arxiv.org/ftp/arxiv/papers/1703/1703.04467.pdf

Now comes the coolest part!

So far, we have fitted the same model for all sites. Geographically weighted regression (GWR) would allow relaxing this. Spatial filtering with MEM can be used to accomplish the same goal, and the 'spmoran' tutorial calls this a 'Spatially Varying Coefficients' model (SVC). The main advantage is that we can visualize how the slope parameter estimates, and their p-values, vary across the study area! This is a great exploratory tool that can help us better understand what is going on.

Model with PatchSize and IBR {-}

We fit the model with 'resf_vc'.

rv_res <- spmoran::resf_vc( y=Dianthus.df$A, 
                            x = Dianthus.df[,c("PatchSize", "IBR")], 
                            xconst = NULL, meig = meigw, method = "reml", x_sel = FALSE)

Instead of one slope estimate for each predictor, we now get a different estimate for each combination of parameter and site (sounds like overfitting?). Here's a summary of the distribution of these estimates.

summary( rv_res$b_vc )

The slope estimate for PatchSize varied between 0.017 and 0.1, with a mean of 0.045. The slope estimate for the 'IBR' term varied between -0.66 and 0.23, with a mean close to 0! That is an astounding range of variation. Keep in mind that we really expect a positive relationship, there is no biological explanation for a negative relationship.

Here is a similar summary of the p-values:

summary( rv_res$p_vc )

For both variables, most sites do not show a significant effect (i.e., only few sites show a p-value < 0.05).

We could print these results by site (type rv_res$b_vc or rv_res$p_vc). Even better, we can plot them in space. We start with combining the data ('Dianthus.df') and the results into one data frame 'Results'. By specifying b=rv_res$b_vc and p=rv_res$p_vc, R will create column names that start with 'b' or 'p', respectively.

Result <- data.frame(Dianthus.df, b=rv_res$b_vc, p=rv_res$p_vc)
names(Result)

Let's start with PatchSize. Here, we first plot PatchSize in space, with symbol size as a function of patch size. In a second plot, we color sites by statistical significance and the size of the symbols represents the parameter estimate of the regression slope coefficient for Patch Size. The layer 'coord_fixed' keeps controls the aspect ratio between x- and y-axes.

require(ggplot2)
ggplot(as.data.frame(Result), aes(X, Y, size=PatchSize)) +
  geom_point(color="darkblue") + coord_fixed()
ggplot(as.data.frame(Result), aes(X, Y, col=p.PatchSize < 0.05, size=b.PatchSize)) +
  geom_point() + coord_fixed()

Let's do the same for 'IBR':

require(ggplot2)
ggplot(as.data.frame(Result), aes(X, Y, size=IBR)) +
  geom_point(color="darkgreen") + coord_fixed()
ggplot(as.data.frame(Result), aes(X, Y, col=p.IBR < 0.05, size=b.IBR)) +
  geom_point() + coord_fixed()

The very small dots in the first map are the ungrazed patches.
From the second map, it looks like the significant values were the one with negative slope estimates, for which we don't have a biological interpretation.

Model with IBR only {-}

Keep in mind that 'IBR' and 'PatchSize' showed a strong correlation. The parameter estimates could therefore depend quite a bit on the other variables. To help with the interpretation, let's repeat the last analysis just with 'IBR', without 'PatchSize'.

rv_res <- spmoran::resf_vc( y=Dianthus.df$A, 
                            x = Dianthus.df[,c("IBR")], 
                            xconst = NULL, meig = meigw, method = "reml", x_sel = FALSE)
summary( rv_res$b_vc )

Now the range of slope estimates is smaller, most sites have a positive estimate, and the mean is approx. 0.21.

summary( rv_res$p_vc )

Also, a larger proportion of sites nows has p-values < 0.05.

Let's plot the results onto a gray-scale, stamen terrain map to facilitate interpretation. Note: here the zoom level zoom = 12 covers the entire study area, whereas the default value would actually cut off a large number of sites. We use the argument force=TRUE to force the map to be downloaded again (otherwise the argument color="bw" may not have an effect if we already downloaded the terrain map in color).

Result <- data.frame(Dianthus.df, b=rv_res$b_vc, p=rv_res$p_vc, resid=rv_res$resid)
ggplot(as.data.frame(Result), aes(X, Y, col=p.V1 < 0.05, size=b.V1)) +
  geom_point() + coord_fixed()

This is a very different map of results!

Most sites now show significant effects.
The sites with larger positive estimates show significant effects, whereas those with small or negative estimates show non-significant effects.
There are 3 - 4 clusters of sites where the IBR models is not effective at explaining variation in allelic richness: in the very East, in the South-East, and one area in the South-West.
Knowing the study area, these are distinct regions (e.g. valleys) that may suggest further biological explanations.

Result.sf <- st_as_sf(Result, coords=c("X", "Y"), crs=st_crs(Dianthus))
Result.sf$Significant <- Result.sf$p.V1 < 0.05

tmap_mode("view")
tm_shape(Result.sf) + tm_bubbles(size="b.V1", col="Significant")

We can compare this to a model with fixed

8. Conclusions {-}

We moved from pair-wise distance matrices (link-based) to node-based analysis by integrating the explanatory distance matrices for IBD and IBR into patch-level connectivity indices Si (neighborhood analysis).
We found no support for the IBD model, and strong support for the IBR model when tested without additional predictors.
The site-level predictors 'PatchSize' (log('Ha')) was strongly correlated with our IBR model, and when PatchSize was added to the model, 'IBR' was no long statistically significant and its slope estimate changed considerably.
The MEM analogue to spatially weighted regression showed very different patterns for 'IBR' depending on whether or not 'PatchSize' was included in the model. Withouth 'PatchSize', 'IBR' showed significant positive correlation with allelic richness across the study area, except for three sub-areas.
In practical terms, this may suggest that the management strategy of maintaining plant functional connectivity through shepherding seems to be working for this species overall, though there are three parts of the study area where this may not be sufficient to maintain gene flow.
The evidence is not conclusive, however, the observed patterns could also be explained by population size, which in this species seems to be associated with patch size. This makes sense if smaller patches contain smaller populations with higher rates of genetic drift.

9. R Exercise Week 7 {-}

The Pulsatilla vulgaris dataset that we've been analyzing in the R exercises has two variables that were observed or calculated for each sampled mother plant (i.e., those plants from which seeds were collected; see DiLeo et al. 2017, Journal of Ecology):

flower.density: he number of flowers within 2 m of the mother plant. A radius of 2 m around mother plants was chosen as it gave the strongest correlation with selfing rates and pollination distances compared to lower (1 m) and higher (3 m) tested values.
mom.isolation: the mean distance of the mother plant to all other plants within the population (mean neighbour distance).

We would expect the following:

A negative relationship, where floral density within 2 m of isolated mother plants is low.
Likely (right-)skewed distributions for both variables.
Positive spatial autocorrelation of both variables within each patch.
Values within each patch likely more similar than between patches.

Consider how the points listed above may violate different assumptions of a linear regression model fitted with least squares (lm). How can we test and account for these potential violations and fit a valid model?

Task: Test the regression of flower density on the isolation of the sampled mother plants of Pulsatilla vulgaris. Account for the sampling of multiple mothers from each of seven patches, and for residual spatial autocorrelation (if statistically significant).

Hints:

a) Load packages: You may want to load the packages dplyr, ggplot2 and 'nlme'. Alternatively, you can use :: to call functions from packages.

b) Import data, add spatial coordinates. Use the code below to import the data, extract moms, add spatial coordinates, and remove replicate flowers sampled from the same mother.

library(dplyr)

# Dataset with variables 'flower.density' and 'mom.isolation' for each mom:
Moms <- read.csv(system.file("extdata",
                            "pulsatilla_momVariables.csv", 
                            package = "LandGenCourse"))

# Dataset with spatial coordinates of individuals:
Pulsatilla <- read.csv(system.file("extdata",
                            "pulsatilla_genotypes.csv", 
                            package = "LandGenCourse"))
Adults <- Pulsatilla %>% filter(OffID == 0)

# Combine data
Moms <- left_join(Moms, Adults[,1:5])

# Remove replicate flowers sampled from the same mother
Moms <- Moms %>% filter(OffID == 0)

c) Explore data. How many mother plants are there in total, and per population? Are the distributions of flower.density and of mom.isolation skewed?

d) Create scatterplots: Use ggplot to create a scatterplot of flower.density (y) against mom.isolation (x). Modify the plot with coord_trans to apply a log-transformation to each axis. Will this make the relationship more linear? - Note: using geom_smooth together with coord_trans can create problems. You may adapt the following code to plot two ggplot-type plots side-by-side: gridExtra::grid.arrange(myPlot1, myPlot2, nrow = 1)

e) Scatterplot with line: Instead of using coord_trans, create a scatterplot with log-transformed variable (log(y) vs. log(x)). Add a regression line.

f) Fit non-spatial models: Adapt code from section 4 to fit two models with the response log(flower.density)and the fixed factor log(mom.isolation), using funcions from package nmle:

- Basic model: you can fit a simple model by adapting this code:

nlme::gls(Response ~ FixedEffect, data=Data, method=REML) - Random effect: add a random effect for Population, use nlme::lme instead of gls. nlme::lme(Response ~ FixedEffect, random = ~ 1| RandomEffect, data=Data, method=REML) - For now, omit the correlation term (no spatial correlation structure).

g) Plot residual variograms: Plot variograms for the two models. Inspect the x-axes of the plots. What is the effect of including the random effect Population on the fitting of the variogram? Recall the sampling design. Was there spatial autocorrelation within populations? Hints:

- Check the variable names in `Moms` to adapt the names of the x and y coordinates as needed in the term `form= ~ xcoord + ycoord`.
- Print each variogram object to check the number of pairs per lag. Ideally, this should be around 100.
- These variogram plots are trellis plots. Fortunately, you can again use `gridExtra::grid.arrange` to plot them side by side. Write each plot into an object first.

h) Add correlation structure: Add a term of the type correlation = nlme::corExp(form = ~ x + y, nugget=T) to the mixed model fitted with REML. Adapt code from section 4.b to evaluate different variogram functions (exponential, spherical, Gaussian, ratio) and use AIC (with REML) to choose the best-fitting variogram model.

i) Check residual plots: For the best model (fitted with REML) and check the residuals. Plot a variogram of the residuals, and the fitted variogram.

k) Test fixed effect: Refit the best model with maximum likelihood to test the fixed effect with car::Anova. Give the model a new name to keep them apart and avoid overwriting.

l) Determine the marginal R-squared. For the best model (fitted with REML), use MuMIn::r.squaredGLMM to determine the marginal and conditional R-squared.

Questions: Justify your answers to the following questions:

Did you find a statistically significant, negative relationship between local floral density and mom isolation? If so, how strong was it?
Was it necessary to account for skewness in both variables?
Was it necessary to account for spatial autocorrelation?
Was it necessary to account for population?

LandGenCourse::detachAllPackages()

hhwagner1/LandGenCourse documentation built on Feb. 17, 2024, 4:42 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

hhwagner1/LandGenCourse
Interface for course "Landscape Genetic Data Analysis with R"

In hhwagner1/LandGenCourse: Interface for course "Landscape Genetic Data Analysis with R"

1. Overview of Worked Example {-}

a) Goals {-}

b) Data set {-}

c) Required R packages {-}

2. Explore data set {-}

a) Import data {-}

b) Create a map {-}

c) Explore correlations {-}

3. Test regression residuals for spatial autocorrelation {-}

a) Fit regression models {-}

b) Test for spatial autocorrelation (Moran's I) {-}

4. Fit models with spatially correlated error (GLS) with package 'nlme' {-}

a) Plot empirical variogram {-}

b) Fit variogram models {-}

c) Select best-fitting model {-}

d) Plot fitted variogram model {-}

e) Add random factor {-}

5. Fit spatial simultaneous autoregressive error models (SAR) {-}

a) Fit and compare alternative SAR models {-}

b) Interpret best-fitting SAR model {-}

6. Spatial filtering with MEM using package 'spmoran' {-}

a) Default method {-}

b) Using a custom connectivity matrix {-}

c) Plot spatial eigenvectors {-}

d) Random effect model {-}

7. Fit spatially varying coefficients model with package 'spmoran' {-}

Model with PatchSize and IBR {-}

Model with IBR only {-}

8. Conclusions {-}

9. R Exercise Week 7 {-}

R Package Documentation

Browse R Packages

We want your feedback!

hhwagner1/LandGenCourse Interface for course "Landscape Genetic Data Analysis with R"

In hhwagner1/LandGenCourse: Interface for course "Landscape Genetic Data Analysis with R"

1. Overview of Worked Example {-}

a) Goals {-}

b) Data set {-}

c) Required R packages {-}

2. Explore data set {-}

a) Import data {-}

b) Create a map {-}

c) Explore correlations {-}

3. Test regression residuals for spatial autocorrelation {-}

a) Fit regression models {-}

b) Test for spatial autocorrelation (Moran's I) {-}

4. Fit models with spatially correlated error (GLS) with package 'nlme' {-}

a) Plot empirical variogram {-}

b) Fit variogram models {-}

c) Select best-fitting model {-}

d) Plot fitted variogram model {-}

e) Add random factor {-}

5. Fit spatial simultaneous autoregressive error models (SAR) {-}

a) Fit and compare alternative SAR models {-}

b) Interpret best-fitting SAR model {-}

6. Spatial filtering with MEM using package 'spmoran' {-}

a) Default method {-}

b) Using a custom connectivity matrix {-}

c) Plot spatial eigenvectors {-}

d) Random effect model {-}

7. Fit spatially varying coefficients model with package 'spmoran' {-}

Model with PatchSize and IBR {-}

Model with IBR only {-}

8. Conclusions {-}

9. R Exercise Week 7 {-}

R Package Documentation

Browse R Packages

We want your feedback!

hhwagner1/LandGenCourse
Interface for course "Landscape Genetic Data Analysis with R"