1. Overview of Worked Example {-}

a) Goals {-}

This worked example shows how to:

b) Data set {-}

Lamy et al. (2012) sampled the freshwater snail Drepanotrema depressissimum in a fragmented landscape of tropical ponds on the island of Guadeloupe in the French West Indies. They used a spatially and temporally stratified sampling design with a total of 25 sites, where 12 sites formed four well-separated clusters of three neighbouring sites each, to study spatial variability, and 12 sites spread across the island were sampled in multiple years to study temporal variability. For each site and year, 22 - 34 individuals were genotyped at ten microsatellite loci. The species is diploid, hermaphroditic, and outcrossed.

A key characteristic of this system is the presence of a dry and a rainy season. In the dry season, many ponds can dry out, possibly causing extinction of the local snail populations. During the rainy season, ponds refill and can even overflow, thereby becoming connected through the hydrological network. During this rainy season, dispersal between ponds may occur.

Reference {-}

Lamy, T., Pointier, J. P., Jarne, P. and David, P. (2012), Testing metapopulation dynamics using genetic, demographic and ecological data. Molecular Ecology, 21: 1394–1410. doi:10.1111/j.1365-294X.2012.05478.x

c) Required R libraries {-}

library(LandGenCourse)
library(methods)
library(dplyr)
#require(tibble)
#require(poppr)
#require(ade4)
#require(pwr)
#require(effsize)
require(sf)
#require(car) 
library(ggplot2)
library(tmap)

d) Import data {-}

Let's import the genetic data (spatial and temporal data sets combined, 42 combinations of site and year).

The 'genind' object 'dd.genind' contains individual-level data in the following slots:

data(dd.genind, package = "LandGenCourse")
dd.genind

We also import site-level data from Tables 2 - 5 in Lamy et al. (2012). This is a (spatial) sf object. Use '?dd.site' to check the helpfile with the data set description of the variables (atribute data).

data(dd.site, package = "LandGenCourse")
tibble::as_tibble(dd.site)
?dd.site

Questions: with the help file for 'dd.site', check the meaning of the following explanatory variables:

Your hypothesis: which explanatory variables would you expect to affect:

In the following, we'll perform three types of analyses:

2. Spatial distribution of genetic structure {-}

How similar are populations from nearby habitat patches compared to populations across the island? To answer this question, we perform a hiearchical AMOVA (analysis of molecular variance) with individuals from 12 populations that form 4 clusters with 3 populations each.

a) Subsetting the 'genind' object {-}

First, we need to extract the samples that belong to the hierarchical data set. There are four clusters: "North", "East", "Center" and "South". We are looking for the observations where the variable "Cluster" has one of these four values, all other observations will have a missing value for "Cluster". We can use '!is.na' to identify the rows with non-missing values. Because a genind object is at its core a data frame @tab with rows (individuals) and columns(alleles), we can subset rows or columns. Here we subset rows.

dd.genind.Cluster <- dd.genind[!is.na(dd.genind@strata$Cluster),]

b) Hierarchical AMOVA {-}

There are several implementations of AMOVA in R, e.g. in pacakges 'ade4', 'pegas' and 'vegan'. The 'ade4' implementation is closest to the original implementation in Arlequin. Package 'poppr' has a wrapper function 'poppr.amova' that makes it easy to perform AMOVA with the 'ade4' or with the 'pegas' implementation (see '?poppr.amova' for a discussion of their pros and cons). Here we'll use 'ade4'.

First we run the AMOVA and estimate the percent of molecular variance at each hierarchical level.

amova.result <- poppr::poppr.amova(dd.genind.Cluster, hier = ~ Cluster/SITE, 
                                   within=FALSE, method = "ade4")
amova.result

Then we test whether each variance component is statistically significant (i.e., significantly larger than zero).

Note: if this takes very long, you can set nrepet = 199 for this exercise (not for your research).

amova.test <- ade4::randtest(amova.result, nrepet = 999)
amova.test

Questions:

3. What determines genetic differentiation among sites? {-}

What factors explain site-specific Fst? Let's consider the key micro-evolutionary processes:

First, we create a new sf object with the subset of data for the spatial analysis (25 ponds, one year each).

dd.spatial <- dd.site[dd.site$Spatial==TRUE,]

a) Correlation matrix {-}

Let's start with a correlation matrix.

dd.df <- st_drop_geometry(dd.spatial)
cor(dd.df[ , c("FST.GESTE", "NLT", "C", "D")], use="pairwise.complete")

Questions:

b) Scatterplots {-}

Let's plot the response variable FST.GESTE against each of the two predictors NLT and C. Here, we use functions from the package ggplot2 (already loaded) to define two ggplot objects NLT.plot and C.plot, then we plot them side-by-side with the function cowplot::plot_grid.

NLT.plot <- ggplot(dd.df, aes(x=NLT, y=FST.GESTE)) + 
  geom_point() + 
  geom_smooth(formula = 'y ~ x', method = lm, se = TRUE) +
  geom_text(aes(x=NLT, y=FST.GESTE, label=SITE), size=2, nudge_x=0, nudge_y=0.01, check_overlap=TRUE)

C.plot <- ggplot(dd.df, aes(x=C, y=FST.GESTE)) + 
  geom_point() +
  geom_smooth(formula = 'y ~ x', method = lm, se = TRUE) +
  geom_text(aes(x=C, y=FST.GESTE, label=SITE), size=2.5, nudge_x=0, nudge_y=0.01, check_overlap=TRUE)

cowplot::plot_grid(NLT.plot, C.plot)

c) Regression model {-}

The two predictors 'NLT' and 'C' are not strongly correlated. We'll fit a regression model with both predictors. Here we use function 'scale' to standardize each variable, so that we can interpret the regression slope coefficients as partial correlation coefficients (beta coefficients).

mod.diff <- lm(scale(FST.GESTE) ~ scale(NLT) + scale(C), 
               data=dd.spatial)
summary(mod.diff)

Is the model valid? Let's check the residual plots.

Here's a link to a great resource about the interpretation of these plots generated by R: http://strata.uga.edu/6370/rtips/regressionPlots.html

par(mfrow=c(2,2))
plot(mod.diff, labels.id = dd.spatial$SITE)
par(mfrow=c(1,1))

If we had more than two predictors, it would be a good idea to calculate variance inflation factors. The package 'car' has a function 'vif' that takes as argument a fitted model. Here, both predictors have VIF = 1.007, which indicates no collinearity.

car::vif(mod.diff)

d) Which populations don't fit the general pattern? {-}

Let's plot the residuals in space. The function tm_bubbles from the package 'sf' evaluates the projection information of the sf object 'dd.spatial'. First, we need to create some new variables:

dd.spatial$Residuals <- mod.diff$residuals
dd.spatial$Absolute <- abs(mod.diff$residuals)

a <- which(dd.spatial$Absolute > 1.5)

Now we can create a bubble plot with the size of the bubble proportional to the absolute value of the residual, and the color according to the sign (positive or negative). For the latter, we use the argument breaks. Here, any value between -Inf and 0 will be plotted in red, and any value between 0 and Inf will be plotted in blue.

In addition, we label the two largest outliers (with absolute values >1.5) and position the labels with the argument just. Here, just=c(0.7,2.5) means that the labels are placed at 0.7 along the horizontal axes and at 2.5 along the vertical axis, compared to the point location. This requires some playing around with values.

tmap_mode("plot")

Map1 <- tm_shape(dd.spatial) + tm_bubbles(size="Absolute",col="Residuals", 
                                  breaks=c(-Inf, 0, Inf), palette=c("red", "blue")) +
  tm_shape(dd.spatial[a,]) + tm_text(text="SITE", size=0.8, just=c(0.7,2.5))

Map1

Export this map as a pdf file. (Un-comment the lines below to run the code below, i.e., remove the hashtag symsbols '#').

#if(!dir.exists(here::here("output"))) dir.create(here::here("output"))
#tmap_save(Map1, file=here::here("output/ResidualMap.pdf"), width = 7, height = 5.5, units = "in",  dpi = 300)

By changing the mode to view, we can create convert the map into an interactive plot with a background map from the internet (see Week 2).

Un-comment the lines below to run the code.

#tmap_mode("view")
#Map1

What might explain the large residuals for the two sites 'PTC' and 'DESB'?

e) Regression model without outliers {-}

We can use the same index a to exclude the potential outliers from the regression model:

mod.diff.minus2 <- lm(scale(FST.GESTE) ~ scale(NLT) + scale(C), 
               data=dd.spatial[-a,])
summary(mod.diff.minus2)
par(mfrow=c(2,2))
plot(mod.diff.minus2, labels.id = dd.spatial$SITE[-a])
par(mfrow=c(1,1))

4. What determines genetic diversity? {-}

a) Correlation matrix {-}

cor(dd.df[, c("RA", "He", "Size", "NLT", "C", "D")],
    use="pairwise.complete")

Questions:

b) Regression models {-}

For allelic richness:

mod.RA <- lm(scale(RA) ~ scale(NLT) + scale(C), data = dd.spatial)
summary(mod.RA)
par(mfrow=c(2,2))
plot(mod.RA, labels.id = dd.spatial$SITE)
par(mfrow=c(1,1))

For gene diversity (expected heterozygosity):

mod.He <- lm(scale(He) ~ scale(NLT) + scale(C), data = dd.spatial)
summary(mod.He)
par(mfrow=c(2,2))
plot(mod.He, labels.id = dd.spatial$SITE)
par(mfrow=c(1,1))

5. Are genetic differentiation and diversity related? {-}

Would you expect a relationship between genetic diversity and genetic differentiation of individual patches?

Lets examine the correlation between gene diversity (He) and site-specific Fst:

cor(dd.site$He, dd.site$FST.GESTE, use = "pairwise.complete")

There are a number of possible reasons for such a correlation. Can you put forward some hypotheses to explain this relationship? See Lamy et al. (2012) for their interpretation.

6. Effect of recent extinction events {-}

Several patches fell dry between observation years, which is assumed to signify extinction of the local population. Does genetic evidence support this interpretation, i.e., is there genetic evidence of bottlenecks or founder effects in D. depressissimum?

a) Effect of patch extinction event (temporal data set) {-}

dd.temporal <- dd.site[dd.site$MultiYear==TRUE,]
dd.temporal.df <- sf::st_drop_geometry(dd.temporal)
cor(dd.temporal.df[, c("Fst.temp", "APE", "NLT", "C")], use="pairwise.complete")

We can compare a number of competing models using the Akaike Information Criterion (AIC). Models with lower AIC are better (see Week 12).

mod.Fst.temp <- lm(scale(Fst.temp) ~ scale(APE), data=dd.temporal.df)
summary(mod.Fst.temp)

mod.Fst.temp.C <- lm(scale(Fst.temp) ~ scale(APE) + scale(C), 
                     data=dd.temporal.df)
mod.Fst.temp.NLT <- lm(scale(Fst.temp) ~ scale(APE) + scale(NLT), 
                       data=dd.temporal.df)
mod.Fst.temp.both <- lm(scale(Fst.temp) ~ scale(APE) + scale(NLT) + scale(C),  
            data=dd.temporal.df)
AIC(mod.Fst.temp, mod.Fst.temp.C, mod.Fst.temp.NLT, mod.Fst.temp.both)

The best model includes neither 'C' nor 'NLT'. Note that 'APE' is a binary variable, so in essence we're performing a t-test here.

res.Fst.temp <- t.test(Fst.temp ~ APE, data=dd.temporal, alternative = "less")
res.Fst.temp

b) Power analysis {-}

The effect is not statistically significant. Does that mean that we found no effect of apparent population extinctions on temporal Fst? Let's check effect size. For means, Cohen's effect size is measured by d (which is measured in units of standard deviations):

We can let R calculate effect size for us:

effsize::cohen.d(Fst.temp ~ factor(APE), data=dd.temporal.df)

So, we actually found a 'medium' effect (more than 0.5 standard deviations difference between group means). Maybe sample size was too small to have sufficient power?

Let's check sample size:

table(dd.temporal$APE[!is.na(dd.temporal.df$Fst.temp)])

Ah, that explains a lot. There were only 5 sites with apparent extinction, and 7 without.

Given that sample size, what was the statistical power of our test to detect at least a large effect (d = - 0.8), i.e., be able to reject the null hypothesis if such an effect is present in the population from which we sampled?

pwr::pwr.t2n.test(n1=7, n2=5, d=-0.8, alternative = "less")

So the power to detect at least a large effect, if it exists in the population, was only 0.355, way below the 0.8 (or even 0.95) that we would want to see. For a medium effect, the power is even smaller.

c) Sample size calculation {-}

How large a sample would we have needed in each group to achieve a power of 0.8 to detect a large effect? And for a medium effect?

pwr::pwr.t.test(power = 0.8, d = -0.8, alternative = "less")
pwr::pwr.t.test(power = 0.8, d = -0.5, alternative = "less")

More than 20 sites in each group would have been needed to detect a large effect, or more than 50 per group to detect a medium effect, with a power of 80%.

Hence, these particular results are inconclusive. There was a trend showing a large effect size but power was very low. This aspect of the study should ideally be repeated with a larger sample size before reaching any conclusions.

Note however that using additional evidence (e.g., population assignment tests), Lamy et al. (2012) concluded that extinctions were in fact less common in this system than previously assumed – in many cases of apparent extinction, individuals may still be present but just not detected.

7. R Exercise Week 4 {-}

Task: Build on your previous exercises and plot the sites on a map downloaded from the internet. Explore the relationships between Hexp, census population size and percent forest cover within 500 m of the site (forest may act as a barrier for grassland plants).

Hints:

a) Load packages: You may want to load the packages dplyr and tmap. Alternatively, you can use :: to call functions from packages. b) Import your datasets from Weeks 2 & 3 R Exercises. Here's an example for your code, adapt it as needed to import the R objects "Pulsatilla.longlat.rds" (sf object, Week 2) and "H.pop.rds" (Week 3) that you saved previously: Pulsatilla.longlat <- readRDS(here::here("output/Pulsatilla.longlat.rds")) c) Plot sites on map from internet: adapt the code from section 3.d to plot the sampling locations on a background map from the internet. Next, modify code from section 3.d to add labels for all sites. d) Combine data: Use the function dplyr::left_join to add the variables from the dataset H.pop to Pulsatilla.longlat. Notes:

- This is important, as the order of populations may not be the same in the two datasets. 
- Remember to check the structure of the datasets (variable names and types) first so that you know which are the ID variables that you can use to match sites. 
- If the two ID variables are not of the same type (e.g., one if a `factor`, the other is `character`), it is best to change the format of one (e.g., with `as.character`) before doing the left-join.

e) Scatterplot with regression line: Create a scatterplot of Hexp (y axis) plotted against nIndiv (x axis). Add a regression line and, if possible, label points. You may modify code from section 3.b or use base R functions.
f) Regression analysis: Adapt code from section 3.c to perform a regression of Hexp (response variable) on the predictor nIndiv. Create residual plots and inspect them. What is the main issue here?

Questions: There is one influential point in the regression analysis:

LandGenCourse::detachAllPackages()


hhwagner1/LandGenCourse documentation built on Feb. 17, 2024, 4:42 p.m.