This worked example shows how to:
Lamy et al. (2012) sampled the freshwater snail Drepanotrema depressissimum in a fragmented landscape of tropical ponds on the island of Guadeloupe in the French West Indies. They used a spatially and temporally stratified sampling design with a total of 25 sites, where 12 sites formed four well-separated clusters of three neighbouring sites each, to study spatial variability, and 12 sites spread across the island were sampled in multiple years to study temporal variability. For each site and year, 22 - 34 individuals were genotyped at ten microsatellite loci. The species is diploid, hermaphroditic, and outcrossed.
A key characteristic of this system is the presence of a dry and a rainy season. In the dry season, many ponds can dry out, possibly causing extinction of the local snail populations. During the rainy season, ponds refill and can even overflow, thereby becoming connected through the hydrological network. During this rainy season, dispersal between ponds may occur.
dd.genind: The dataset 'dd.genind' with genetic data for 1270 snails from 42 populations is included in package 'LandGenCourse'. To load it, type: data(dd.genind).
dd.site: Population-level data from Tables 2 - 5 of Lamy et al. (2012) are available in dataset dd.site
(with 25 variables) in package LandGenCourse
. To load it, type: data(dd.site)
.
Lamy, T., Pointier, J. P., Jarne, P. and David, P. (2012), Testing metapopulation dynamics using genetic, demographic and ecological data. Molecular Ecology, 21: 1394–1410. doi:10.1111/j.1365-294X.2012.05478.x
library(LandGenCourse) library(methods) library(dplyr) #require(tibble) #require(poppr) #require(ade4) #require(pwr) #require(effsize) require(sf) #require(car) library(ggplot2) library(tmap)
Let's import the genetic data (spatial and temporal data sets combined, 42 combinations of site and year).
The 'genind' object 'dd.genind' contains individual-level data in the following slots:
data(dd.genind, package = "LandGenCourse") dd.genind
We also import site-level data from Tables 2 - 5 in Lamy et al. (2012). This is a (spatial) sf
object. Use '?dd.site' to check the helpfile with the data set description of the variables (atribute data).
data(dd.site, package = "LandGenCourse") tibble::as_tibble(dd.site) ?dd.site
Questions: with the help file for 'dd.site', check the meaning of the following explanatory variables:
Your hypothesis: which explanatory variables would you expect to affect:
In the following, we'll perform three types of analyses:
How similar are populations from nearby habitat patches compared to populations across the island? To answer this question, we perform a hiearchical AMOVA (analysis of molecular variance) with individuals from 12 populations that form 4 clusters with 3 populations each.
First, we need to extract the samples that belong to the hierarchical data set. There are four clusters: "North", "East", "Center" and "South". We are looking for the observations where the variable "Cluster" has one of these four values, all other observations will have a missing value for "Cluster". We can use '!is.na' to identify the rows with non-missing values. Because a genind object is at its core a data frame @tab
with rows (individuals) and columns(alleles), we can subset rows or columns. Here we subset rows.
dd.genind.Cluster <- dd.genind[!is.na(dd.genind@strata$Cluster),]
There are several implementations of AMOVA in R, e.g. in pacakges 'ade4', 'pegas' and 'vegan'. The 'ade4' implementation is closest to the original implementation in Arlequin. Package 'poppr' has a wrapper function 'poppr.amova' that makes it easy to perform AMOVA with the 'ade4' or with the 'pegas' implementation (see '?poppr.amova' for a discussion of their pros and cons). Here we'll use 'ade4'.
@strata
slot of the 'genind' object. First we run the AMOVA and estimate the percent of molecular variance at each hierarchical level.
amova.result <- poppr::poppr.amova(dd.genind.Cluster, hier = ~ Cluster/SITE, within=FALSE, method = "ade4") amova.result
Then we test whether each variance component is statistically significant (i.e., significantly larger than zero).
Note: if this takes very long, you can set nrepet = 199
for this exercise (not for your research).
amova.test <- ade4::randtest(amova.result, nrepet = 999) amova.test
Questions:
What factors explain site-specific Fst? Let's consider the key micro-evolutionary processes:
First, we create a new sf
object with the subset of data for the spatial analysis (25 ponds, one year each).
dd.spatial <- dd.site[dd.site$Spatial==TRUE,]
Let's start with a correlation matrix.
sf
object with st_drop_geometry
. This will return a data frame dd.df
with the attribute data. cor
.use
of the function cor
defines how to handle missing values. use="pairwise.complete"
means that for each pair of variables, those individuals with no missing values for these variable will be used. See help file ?cor
for alternatives.dd.df <- st_drop_geometry(dd.spatial) cor(dd.df[ , c("FST.GESTE", "NLT", "C", "D")], use="pairwise.complete")
Questions:
Let's plot the response variable FST.GESTE against each of the two predictors NLT and C. Here, we use functions from the package ggplot2
(already loaded) to define two ggplot objects NLT.plot
and C.plot
, then we plot them side-by-side with the function cowplot::plot_grid
.
dd.spatial@data
, the x-axis as variable NLT, the y-axis as variable FST.GESTE, and the labels as variable SITE. geom_smooth
), make it linear (method = lm
) and add a shaded area for plus/minus 1 SE of the mean for a given value of x (se = TRUE
).geom_label
), define their size (size
), and move them up a little along the y-axis (nudge_y
)NLT.plot <- ggplot(dd.df, aes(x=NLT, y=FST.GESTE)) + geom_point() + geom_smooth(formula = 'y ~ x', method = lm, se = TRUE) + geom_text(aes(x=NLT, y=FST.GESTE, label=SITE), size=2, nudge_x=0, nudge_y=0.01, check_overlap=TRUE) C.plot <- ggplot(dd.df, aes(x=C, y=FST.GESTE)) + geom_point() + geom_smooth(formula = 'y ~ x', method = lm, se = TRUE) + geom_text(aes(x=C, y=FST.GESTE, label=SITE), size=2.5, nudge_x=0, nudge_y=0.01, check_overlap=TRUE) cowplot::plot_grid(NLT.plot, C.plot)
The two predictors 'NLT' and 'C' are not strongly correlated. We'll fit a regression model with both predictors. Here we use function 'scale' to standardize each variable, so that we can interpret the regression slope coefficients as partial correlation coefficients (beta coefficients).
mod.diff <- lm(scale(FST.GESTE) ~ scale(NLT) + scale(C), data=dd.spatial) summary(mod.diff)
Is the model valid? Let's check the residual plots.
Here's a link to a great resource about the interpretation of these plots generated by R: http://strata.uga.edu/6370/rtips/regressionPlots.html
par(mfrow=c(2,2)) plot(mod.diff, labels.id = dd.spatial$SITE) par(mfrow=c(1,1))
If we had more than two predictors, it would be a good idea to calculate variance inflation factors. The package 'car' has a function 'vif' that takes as argument a fitted model. Here, both predictors have VIF = 1.007, which indicates no collinearity.
car::vif(mod.diff)
Let's plot the residuals in space. The function tm_bubbles
from the package 'sf' evaluates the projection information of the sf
object 'dd.spatial'. First, we need to create some new variables:
a
that identifies potential outliers (here: absolute value > 1.5)dd.spatial$Residuals <- mod.diff$residuals dd.spatial$Absolute <- abs(mod.diff$residuals) a <- which(dd.spatial$Absolute > 1.5)
Now we can create a bubble plot with the size of the bubble proportional to the absolute value of the residual, and the color according to the sign (positive or negative). For the latter, we use the argument breaks
. Here, any value between -Inf and 0 will be plotted in red, and any value between 0 and Inf will be plotted in blue.
In addition, we label the two largest outliers (with absolute values >1.5) and position the labels with the argument just
. Here, just=c(0.7,2.5)
means that the labels are placed at 0.7 along the horizontal axes and at 2.5 along the vertical axis, compared to the point location. This requires some playing around with values.
tmap_mode("plot") Map1 <- tm_shape(dd.spatial) + tm_bubbles(size="Absolute",col="Residuals", breaks=c(-Inf, 0, Inf), palette=c("red", "blue")) + tm_shape(dd.spatial[a,]) + tm_text(text="SITE", size=0.8, just=c(0.7,2.5)) Map1
Export this map as a pdf file. (Un-comment the lines below to run the code below, i.e., remove the hashtag symsbols '#').
#if(!dir.exists(here::here("output"))) dir.create(here::here("output")) #tmap_save(Map1, file=here::here("output/ResidualMap.pdf"), width = 7, height = 5.5, units = "in", dpi = 300)
By changing the mode to view, we can create convert the map into an interactive plot with a background map from the internet (see Week 2).
Un-comment the lines below to run the code.
#tmap_mode("view") #Map1
What might explain the large residuals for the two sites 'PTC' and 'DESB'?
We can use the same index a
to exclude the potential outliers from the regression model:
mod.diff.minus2 <- lm(scale(FST.GESTE) ~ scale(NLT) + scale(C), data=dd.spatial[-a,]) summary(mod.diff.minus2)
par(mfrow=c(2,2)) plot(mod.diff.minus2, labels.id = dd.spatial$SITE[-a]) par(mfrow=c(1,1))
cor(dd.df[, c("RA", "He", "Size", "NLT", "C", "D")], use="pairwise.complete")
Questions:
For allelic richness:
mod.RA <- lm(scale(RA) ~ scale(NLT) + scale(C), data = dd.spatial) summary(mod.RA)
par(mfrow=c(2,2)) plot(mod.RA, labels.id = dd.spatial$SITE) par(mfrow=c(1,1))
For gene diversity (expected heterozygosity):
mod.He <- lm(scale(He) ~ scale(NLT) + scale(C), data = dd.spatial) summary(mod.He)
par(mfrow=c(2,2)) plot(mod.He, labels.id = dd.spatial$SITE) par(mfrow=c(1,1))
Would you expect a relationship between genetic diversity and genetic differentiation of individual patches?
Lets examine the correlation between gene diversity (He) and site-specific Fst:
cor(dd.site$He, dd.site$FST.GESTE, use = "pairwise.complete")
There are a number of possible reasons for such a correlation. Can you put forward some hypotheses to explain this relationship? See Lamy et al. (2012) for their interpretation.
Several patches fell dry between observation years, which is assumed to signify extinction of the local population. Does genetic evidence support this interpretation, i.e., is there genetic evidence of bottlenecks or founder effects in D. depressissimum?
dd.temporal <- dd.site[dd.site$MultiYear==TRUE,] dd.temporal.df <- sf::st_drop_geometry(dd.temporal) cor(dd.temporal.df[, c("Fst.temp", "APE", "NLT", "C")], use="pairwise.complete")
We can compare a number of competing models using the Akaike Information Criterion (AIC). Models with lower AIC are better (see Week 12).
mod.Fst.temp <- lm(scale(Fst.temp) ~ scale(APE), data=dd.temporal.df) summary(mod.Fst.temp) mod.Fst.temp.C <- lm(scale(Fst.temp) ~ scale(APE) + scale(C), data=dd.temporal.df) mod.Fst.temp.NLT <- lm(scale(Fst.temp) ~ scale(APE) + scale(NLT), data=dd.temporal.df) mod.Fst.temp.both <- lm(scale(Fst.temp) ~ scale(APE) + scale(NLT) + scale(C), data=dd.temporal.df) AIC(mod.Fst.temp, mod.Fst.temp.C, mod.Fst.temp.NLT, mod.Fst.temp.both)
The best model includes neither 'C' nor 'NLT'. Note that 'APE' is a binary variable, so in essence we're performing a t-test here.
res.Fst.temp <- t.test(Fst.temp ~ APE, data=dd.temporal, alternative = "less") res.Fst.temp
The effect is not statistically significant. Does that mean that we found no effect of apparent population extinctions on temporal Fst? Let's check effect size. For means, Cohen's effect size is measured by d (which is measured in units of standard deviations):
We can let R calculate effect size for us:
effsize::cohen.d(Fst.temp ~ factor(APE), data=dd.temporal.df)
So, we actually found a 'medium' effect (more than 0.5 standard deviations difference between group means). Maybe sample size was too small to have sufficient power?
Let's check sample size:
table(dd.temporal$APE[!is.na(dd.temporal.df$Fst.temp)])
Ah, that explains a lot. There were only 5 sites with apparent extinction, and 7 without.
Given that sample size, what was the statistical power of our test to detect at least a large effect (d = - 0.8), i.e., be able to reject the null hypothesis if such an effect is present in the population from which we sampled?
pwr::pwr.t2n.test(n1=7, n2=5, d=-0.8, alternative = "less")
So the power to detect at least a large effect, if it exists in the population, was only 0.355, way below the 0.8 (or even 0.95) that we would want to see. For a medium effect, the power is even smaller.
How large a sample would we have needed in each group to achieve a power of 0.8 to detect a large effect? And for a medium effect?
pwr::pwr.t.test(power = 0.8, d = -0.8, alternative = "less") pwr::pwr.t.test(power = 0.8, d = -0.5, alternative = "less")
More than 20 sites in each group would have been needed to detect a large effect, or more than 50 per group to detect a medium effect, with a power of 80%.
Hence, these particular results are inconclusive. There was a trend showing a large effect size but power was very low. This aspect of the study should ideally be repeated with a larger sample size before reaching any conclusions.
Note however that using additional evidence (e.g., population assignment tests), Lamy et al. (2012) concluded that extinctions were in fact less common in this system than previously assumed – in many cases of apparent extinction, individuals may still be present but just not detected.
Task: Build on your previous exercises and plot the sites on a map downloaded from the internet. Explore the relationships between Hexp, census population size and percent forest cover within 500 m of the site (forest may act as a barrier for grassland plants).
Hints:
a) Load packages: You may want to load the packages dplyr
and tmap
. Alternatively, you can use ::
to call functions from packages.
b) Import your datasets from Weeks 2 & 3 R Exercises. Here's an example for your code, adapt it as needed to import the R objects "Pulsatilla.longlat.rds" (sf
object, Week 2) and "H.pop.rds" (Week 3) that you saved previously:
Pulsatilla.longlat <- readRDS(here::here("output/Pulsatilla.longlat.rds"))
c) Plot sites on map from internet: adapt the code from section 3.d to plot the sampling locations on a background map from the internet. Next, modify code from section 3.d to add labels for all sites.
d) Combine data: Use the function dplyr::left_join
to add the variables from the dataset H.pop
to Pulsatilla.longlat
. Notes:
- This is important, as the order of populations may not be the same in the two datasets. - Remember to check the structure of the datasets (variable names and types) first so that you know which are the ID variables that you can use to match sites. - If the two ID variables are not of the same type (e.g., one if a `factor`, the other is `character`), it is best to change the format of one (e.g., with `as.character`) before doing the left-join.
e) Scatterplot with regression line: Create a scatterplot of Hexp
(y axis) plotted against nIndiv
(x axis). Add a regression line and, if possible, label points. You may modify code from section 3.b or use base R functions.
f) Regression analysis: Adapt code from section 3.c to perform a regression of Hexp
(response variable) on the predictor nIndiv
. Create residual plots and inspect them. What is the main issue here?
Questions: There is one influential point in the regression analysis:
LandGenCourse::detachAllPackages()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.