require(learnr) knitr::opts_chunk$set(echo = TRUE, eval = TRUE, include = TRUE, out.width="99%") knitr::knit_theme$set("rootwater") learnr::tutorial_options(exercise.timelimit = 10, exercise.eval = FALSE, exercise.completion = TRUE)
### HIDDEN ### require(vegan) require(labdsv) require(ade4) require(ecodist) require(fso) require(vegclust) require(ape) require(picante) require(mgcv) load('veg.rda') # assumed to be in your working dir xy <- veg$xy # spatial spe <- veg$spe # species env <- veg$env # environment tra <- veg$tra # traits phy <- veg$phy # phylogeny # rm(veg) # cleanup spe <- data.frame(log10(spe + 1)) env <- data.frame(decostand(scale(env, center=F), 'range')) tra <- data.frame(decostand(tra, 'range')) d <- vegdist(spe, method='bray', binary=T) D <- stepacross(d, 'shortest', toolong=1, trace=F) E <- dist(xy) # spatial distance matrix pc <- pcnm(E, w=rowSums(spe)/sum(spe)) # PCNMs *weighted* by abundances cl <- vegclustdist(D, mobileMemb=7, method='FCM', m=1.2, nstart=5) grp <- defuzzify(cl, 'max')[[2]] m1 <- metaMDS(D, k=2, try=200, trymax=500, trace=0) # NMS m2 <- cmdscale(D, k=2, add=T) # PCoA m3 <- prcomp(spe) # PCA grp_k <- cut(env$k, quantile(env$k, probs=seq(0,1,by=0.25)), include.lowest=T, labels=c('lo','med','hi','veryhi')) `get_palette` <- function() { pal <- c('#414487E6','#404688E6','#3F4889E6','#3E4989E6','#3E4C8AE6', '#3D4E8AE6','#3C508BE6','#3B528BE6','#3A548CE6','#39558CE6', '#38588CE6','#375A8CE6','#365C8DE6','#355E8DE6','#35608DE6', '#34618DE6','#33638DE6','#32658EE6','#31678EE6','#30698EE6', '#306A8EE6','#2F6C8EE6','#2E6E8EE6','#2D708EE6','#2C718EE6', '#2C738EE6','#2B748EE6','#2A768EE6','#2A788EE6','#297A8EE6', '#287C8EE6','#287D8EE6','#277F8EE6','#26818EE6','#26828EE6', '#25848EE6','#24868EE6','#24878EE6','#23898EE6','#228B8DE6', '#228D8DE6','#218F8DE6','#21908CE6','#20928CE6','#20938CE6', '#1F958BE6','#1F978BE6','#1F998AE6','#1F9A8AE6','#1E9C89E6', '#1F9E89E6','#1FA088E6','#1FA187E6','#20A386E6','#20A486E6', '#21A685E6','#22A884E6','#24AA83E6','#25AC82E6','#26AD81E6', '#28AE80E6','#2AB07FE6','#2DB27DE6','#2FB47CE6','#32B67AE6', '#34B679E6','#37B878E6','#3ABA76E6','#3DBC74E6','#40BD72E6', '#43BF71E6','#47C06FE6','#4AC16DE6','#4EC36BE6','#52C569E6', '#55C668E6','#59C864E6','#5DC863E6','#60CA60E6','#65CB5EE6', '#68CD5BE6','#6DCD59E6','#71CF57E6','#75D054E6','#7AD151E6', '#7FD34EE6','#83D44CE6','#87D549E6','#8CD646E6','#90D743E6', '#95D840E6','#9AD93CE6','#9FDA3AE6','#A3DA37E6','#A8DB34E6', '#ADDC30E6','#B2DD2DE6','#B7DE2AE6','#BBDF27E6') return(pal) } `colvec` <- function(x) { pal <- get_palette() return(pal[cut(as.numeric(x), breaks=length(pal), include.lowest=TRUE)]) } `makecwm` <- function (spe, tra) { spe <- as.matrix(spe) tra <- as.matrix(tra) `standardize` <- function(x) { (x - min(x, na.rm=TRUE)) / diff(range(x, na.rm=TRUE)) } tra <- apply(tra, MARGIN = 2, FUN = standardize) awt <- spe %*% tra awt / rowSums(spe, na.rm=TRUE) } cwm <- data.frame(makecwm(spe, tra)) # make the CWM traits matrix ### HIDDEN ###
This course demonstrates the practice of ecological community analysis in R. Our goal is to link biological reality with corresponding conceptual and mathematical models. We borrow and apply concepts from ecology, statistics and computer science, but do not get too bogged down in theoretical details. This short course aims to be an axe handle{target="_blank"} --- carve and shape it to your own practice.
We will:
We assume you have a basic familiarity with R. If not, please peruse here{target="_blank"}. At the least, you should be able to assign an object, distinguish between different objects (vector, matrix, data.frame, etc.), and index rows/columns of a matrix or data.frame.
We assume you have familiarity with basic statistics (population vs sample; mean, SD, SE, CI; simple linear models), and have at least dabbled in linear algebra (matrices; linear combinations), and perhaps become suspicious that calculus may indeed exist. If not, you could brush up on matrices{target="_blank"}.
Packages contain supplemental functions you will need. They are supplemental beyond "base" R. Ordinarily you would open R and install required packages as follows (but these dependencies should have automatically installed when you first installed this tutorial).
install.packages('vegan') # community ecology install.packages('labdsv') # community ecology install.packages('ade4') # community ecology install.packages('ecodist') # community ecology install.packages('fso') # ordination install.packages('vegclust') # clustering install.packages('ape') # phylogenetics install.packages('picante') # phylogenetics install.packages('mgcv') # nonlinear regression
To use functions provided by packages, load them into your R environment. Again, this has already been done for you in the tutorial.
require(vegan) require(labdsv) require(ade4) require(ecodist) require(fso) require(vegclust) require(ape) require(picante) require(mgcv)
Base R already contains lots of useful functions. This list has a few basic functions:
ls() dim(spe) NROW(spe) NCOL(spe) str(spe) head(spe) names(spe) x <- spe[,11] x <- spe[,'bolboschoenus_maritimus'] hist(x) plot(x) rm(x)
From the list above, copy-and-paste some commands into the code box below, and help describe a dataset called spe
. What is inside that object? What size is it?
### run code here
We've already said that you can use functions from base R or from other packages. But you can also "roll your own"! Functions take the generic form f <- function(x) { x }
. For example:
# function to pick an item from the vector by position `item_picker` <- function(x, pos = NULL) { x[pos] } z <- 1:19 # assign a sequence of numbers to object `z` item_picker(z, pos = 5) # pick the fifth element of `z`
We will use one core dataset throughout this tutorial. The veg
dataset gives information about spatial coordinates, species, environment, traits, and phylogeny for plants on the Mafragh coastal plain in North Africa. Specifically, veg
is a list containing five items:
xy
97 observations of 2 spatial coordinatesspe
97 observations of 56 plant speciesenv
97 observations of 11 soil environmental variablestra
12 traits for 56 plant speciesphy
phylogeny for 56 plant speciesA common task in R is to read from CSV files, e.g., x <- read.csv('path/to/file.csv')
, but here we will load data that's already in R format.
# load('./veg.rda') # assumed to be in your working dir xy <- veg$xy # spatial spe <- veg$spe # species env <- veg$env # environment tra <- veg$tra # traits phy <- veg$phy # phylogeny rm(veg) # cleanup ls() # objects now in this local environment
The Mafragh plain is in El Tarf Province of far northeast Algeria (36.84704$^{\circ}$, 7.945$^{\circ}$). According to documentation in R package ade4
:
This marshy coastal plain is a geomorphological feature, east of Port of Annaba, limited to the north by the Mediterranean Sea and a dune cordon, to the south by clay-sandstone Numidian massifs, to the west by a wadi and to East by an irrigated perimeter. It covers 15,000 ha of which 10,000 form the area of study.
Potential human pressures include grazing, irrigation and salination, plowing, and housing development. Its Köppen climate type is "Csa: Hot-summer Mediterranean climate".
The data originate from the field studies of Prof. Gérard de Bélair of Annaba in northeast Algeria. Born in France, de Bélair became a missionary priest in 1969 shortly after Algeria's independence, then an agronomist during the Agrarian Revolution, then a botanist, then professor, then conservationist (according to his lay and clergy accounts). As of 2021, he continues to actively publish on the vegetation of northern Algeria.
We should recognize the colonialist legacies that brought us here. France controlled Algeria from 1830 until its independence in 1962. Prior to that, Algeria was occupied by Ottoman Turks, Spaniards, Arab Muslims, Berber/Amazigh, Romans, Phoenicians, and prehistoric occupants dating to at least 1.8 million years ago. We may speculate that the region's vegetation and cultural histories have been closely intertwined.
The data were included in early versions of the ade4
software (Dray and Dufour 2007) which itself has its roots at Université Lyon 1, France. The present data are lightly modified from Appendix S4 of Pavoine et al. (2011), who also added a phylogeny.
de Bélair, G. 1981. Biogéographie et aménagement de la Plaine de la Mafragh (Annaba - Algérie) [Biogeography and development of the Mafragh Plain (Annaba, Algeria)]. Dissertation, Université Paul Valéry, Montpellier. http://www.sudoc.fr/00871262X
de Bélair, G. 1990. Structure, fonctionnement et perspectives de gestion de quatre écocomplexes lacustres et marécageux (El Kala, Est-Algérien) [Structure, function and management perspectives of four lake and marsh eco-complexes (El Kala, East Algeria)]. Dissertation, Université des sciences et techniques de Montpellier 2. http://www.sudoc.fr/007729197
de Bélair, G. and M. Bencheikh-Lehocine. 1987. Composition et déterminisme de la végétation d'une plaine côtière marécageuse: La Mafragh (Annaba, Algérie) [Composition and determinism of the vegetation of a marshy coastal plain: La Mafragh (Annaba, Algeria)]. Bulletin d'Ecologie, 18(4): 393–407.
Dray, S. and A. Dufour. 2007. The ade4
package: implementing the duality diagram for ecologists. Journal of Statistical Software 22(4): 1-20. doi:10.18637/jss.v022.i04
Pavoine, S., Vela, E., Gachet, S., de Bélair, G. and Bonsall, M. B. 2011. Linking patterns in phylogeny, traits, abiotic variables and space: a novel approach to linking environmental filtering and plant community assembly. Journal of Ecology 99: 165–175. doi:10.1111/j.1365-2745.2010.01743.x
A first step in exploring ecological data is to examine its structure. Copy-and-paste some of this code into the box below, and run one at a time:
str(spe) # community (species) abundance matrix str(env) # environmental matrix str(tra) # traits matrix str(phy, 1) # phylogeny str(xy) # spatial coordinates
### run code here
### create example matrix with some pesky values s <- spe s[5,2] <- NA s[6,4] <- (-999) ### index the row and column of any NA values which(is.na(s), arr.ind=TRUE) ### index the row and column of any negative values which(s < 0, arr.ind=TRUE) ### replace value by logical test s[is.na(s)] <- 777 head(s)
### load data spe <- veg$spe # species env <- veg$env # environment tra <- veg$tra # traits ### basic transformations spet <- data.frame(log10(spe + 1)) envt <- data.frame(vegan::decostand(scale(env, center=F), 'range')) trat <- data.frame(vegan::decostand(tra, 'range')) ### compare a few par(mfrow=c(1,2)) plot(tra$bfp, trat$bfp) plot(env$k2o, envt$k2o)
Sometimes we simply need to convert abundance values (counts or percentages) to binary presence/absence values (0 or 1).
s <- (spe > 0) * 1 # from numeric to 0/1 s[1:5, 1:7] # peek at a few range(s) # is this what you expect?
A quick way to search for multivariate outliers is to use the distance/dissimilarity matrix. A multivariate outlier has unusual values for a combination of multiple variables (McCune and Grace 2002). We can identify them as the sample unit(s) which exceed a defined number of standard deviations away from the grand mean (i.e., a large z-score).
### define outlier function `outliers` <- function (x, mult=2, method='bray') { d <- as.matrix(vegan::vegdist(x, method=method, binary=F, diag=T, upper=T)) diag(d) <- as.numeric(1) # avoid zero-multiplication m <- apply(d, 2, mean) # site means z <- scale(m) # z-scores data.frame(mean_dist = m, z = z, is_outlier = abs(z) >= mult) } ### try it o <- outliers(spe, mult=2) head(o, 7) which(o$is_outlier) ### Thinker: how would you find the z-score of these outliers?
Before proceeding with most multivariate analyses, our species abundance matrix must have 1) no missing values, 2) every site must have at least one species observation, and 3) every species must occur in at least one site.
!anyNA(spe) # expect TRUE, no missing values all(rowSums(spe, na.rm=T) != 0) # expect TRUE, no empty sites all(colSums(spe, na.rm=T) != 0) # expect TRUE, no empty species
Let's map the spatial coordinates.
# spatial plot(xy, pch=19, col='grey', xlab='Eastness', ylab='Northness')
Plot the species abundance matrix as a heatmap.
# species vegan::tabasco(spe, col=get_palette())
Plot the soils (environmental) matrix as a heatmap.
# environment vegan::tabasco(env, col=get_palette())
Plot the traits matrix as a heatmap.
# traits vegan::tabasco(tra, col=get_palette())
Plot the phylogenetic tree.
# phylogeny plot(phy, cex=0.6, no.margin=TRUE)
McCune, B., and J.B. Grace. 2002. Analysis of Ecological Communities. MjM Software Design, Gleneden Beach, Oregon, USA.
Let's examine the "diversity" of our sites and study area. Diversity can encompass any sort of variation, e.g., taxonomic, trait, or phylogenetic. Here we assume that spe
is a species abundance matrix where rows = sites, columns = taxa, and values must not be negative or missing.
gamma <- sum(colSums(spe) > 0) gamma
alpha <- rowSums(spe > 0) alpha # within-site avgalpha <- mean(rowSums(spe > 0)) avgalpha # average within-site
gamma <- sum(colSums(spe) > 0) avgalpha <- mean(rowSums(spe > 0)) beta <- gamma / avgalpha - 1 beta
McCune and Root (2015) proposed two model-free measures of how strongly multivariate species data exhibit a "dust bunny" distribution, that is, departure from multivariate normality because of many zero abundances due to few species occurring in any given sample unit. Quantifying this is important because multivariate normality is a core assumption of many analytical tools (PCA, CCA, linear regression models).
### 1 -- proportion of zeros in the matrix ### (independent of abundance) eps <- .Machine$double.eps # machine tolerance propzero <- sum(spe < eps) / prod(dim(spe)) cat('Proportion of zeros in matrix:', propzero, '\n') ### 2 -- "dust bunny index" of McCune and Root (2015) ### (integrates abundances) dbi <- 1 - mean(as.matrix(vegan::decostand(spe, method='max'))) cat('Dust bunny index:', dbi, '\n')
High species turnover (beta-diversity) can cause many site-pairs to share no species in common. What proportion of all pairs of sites have totally different sets of species?
### how many site-pairs share no species in common? z <- vegan::no.shared(spe) propnoshare <- sum(z) / length(z) cat('Proportion of no-share sites:', propnoshare, '\n')
McCune, B., and H.T. Root. 2015. Origin of the dust bunny distribution in ecological community data. Plant Ecology 216(5): 645-656.
Dissimilarities among sites help us position sites along gradients, organize them into clusters, or distinguish differences among groups of sites.
Create a pairwise dissimilarity matrix D
based on shared and unshared species among sites. Euclidean distances are seldom useful for species abundance data, here we use the more preferable Bray-Curtis (Sørensen) dissimilarity. Legendre and Legendre (2012) give a full survey of existing distance/dissimilarity measures.
d <- vegdist(spe, method='bray', binary=T) str(d) tabasco(as.matrix(d), col=get_palette()) ### Thinker: why is the diagonal 'blank'?
Sometimes a pair of sites are "maximally dissimilar": they share no species at all, so pairwise dissimilarity is impossible to quantify using a proportional measure (e.g., Bray-Curtis). Stepacross adjustment solves this by linking "no-share" sites with intermediate site(s) that share at least some species in common. Smith (2017) describes other possible solutions.
D <- stepacross(d, 'shortest', toolong = 1) plot(d, D, xlab = 'Original', ylab = 'Stepacross') ### Thinker: why the vertical stripe near 1.0?
Create a pairwise distance matrix E
based on Euclidean environmental distances among sites.
E <- vegdist(env, method='bray', binary=T) str(E) tabasco(as.matrix(E), col=get_palette())
Legendre, P., and L. Legendre. 2012. Numerical Ecology. Volume 24, Third Edition. Elsevier.
Smith, R.J. 2017. Solutions for loss of information in high-beta-diversity community data. Methods in Ecology and Evolution 8:68–74.
Order sites according to community compositions. Ordination distances indicate compositional dissimilarities. Nearby points are more compositionally similar than distant points.
# three kinds of ordination m1 <- metaMDS(D, k=2, try=50, trymax=51, trace=0) # NMS m2 <- cmdscale(D, k=2, add=T) # PCoA m3 <- prcomp(spe) # PCA
For ecological community data, NMS (Kruskal 1964) is almost always the best choice (Minchin 1987).
# color vector for plotting u <- get_palette() u <- u[1:nrow(spe)] # compare three kinds of ordination par(mfrow=c(1,3), bty='l', las=1) plot(m1$points, type='n', xlab='NMDS1', ylab='NMDS2') text(m1$points, rownames(m1$points),cex=.8, col=u) plot(m2$points, type='n', xlab='PCoA1', ylab='PCoA2') text(m2$points, rownames(m2$points),cex=.8, col=u) plot(m3$x, type='n', xlab='PCA1', ylab='PCA2') text(m3$x, rownames(m3$x),cex=.8, col=u) par(mfrow=c(1,1))
Plotting again, now overlay environmental variables.
### overlay gradients on the NMS using point colors par(mfrow=c(1,3), bty='l', las=1) plot(scores(m1), pch=16, col=colvec(env$k)) plot(scores(m1), pch=16, col=colvec(env$k2o)) plot(scores(m1), pch=16, col=colvec(env$sand)) par(mfrow=c(1,1)) ### overlay gradients on the NMS by fitting a GAM surface f1 <- ordisurf(m1 ~ env$k, plot = FALSE) plot(scores(m1), pch = 16, col = colvec(env$k)) plot(f1, add = TRUE, col = 1, lwd = 2)
Ordination methods have intrinsic measures of fit (e.g., stress for NMS). However, for all methods, a "good" ordination should have a decent correlation between the Euclidean inter-point ordination distances versus the original species dissimilarities used to construct the ordination.
cor(dist(m1$points), D, method='kendall') # NMS almost always best cor(dist(m2$points), D, method='kendall') cor(dist(m3$x), D, method='kendall') ### Thinker: why is this unfair to PCA?
We can also use Procrustes analysis (Peres-Neto and Jackson 2001) to examine the fit of a species ordination versus an environmental ordination.
pc <- prcomp(env[,sapply(env, is.numeric)], scale=T) # environmental PCA pc <- scores(pc, display='sites', choices=1:2) # environmental PCA scores eD <- vegdist(pc, method='euc') # Euclidean interpoint distances protest(m1$points, pc) # fit = correlation in Procrustes rotation protest(D, eD) # fit = correlation in Procrustes rotation
For NMS, the user must select the number of dimensions a priori. Yet, what is the "best" number of dimensions? We can examine how stress changes as we increment the number of dimensions, and select the dimensionality that balances sufficiently "low" stress with the fewest dimensions.
# define screeplot function, running NMS for varying 'k' dimensions `scree_nms` <- function(D, k=5, ...) { stress <- rep(NA, k) for (i in 1:k) { cat('calculating', i, 'of', k, 'dimensions...\n') stress[i] <- metaMDS(D, k=i, trace=0, ...)$stress } plot(1:k, stress, main='', xlab='Dimension', ylab='Stress', ylim=c(0, max(stress)*1.05), pch=16, las=1, bty='l') lines(1:k, stress) abline(0.20, 0, col='red', lty = 2) data.matrix(stress) } scree <- scree_nms(D, k=5, trymax=10) scree ### Thinker: how many dimensions would you select?
Kruskal, J.B. 1964. Multidimensional scaling by optimizing goodness-of-fit to a nonmetric hypothesis. Psychometrika 29: 1–28.
Minchin, P.R. 1987. An evaluation of relative robustness of techniques for ecological ordinations. Vegetatio 69: 89–107.
Peres-Neto, P. R., and D. A. Jackson. 2001. How well do multivariate data sets match? The advantages of a Procrustean superimposition approach over the Mantel test. Oecologia 129:169–178.
Constrained ordination addresses specific a priori hypotheses about variation in community compositions attributable to measured environmental gradients. It portrays only a fragment of all community variation, and is not used for exploratory analysis.
Two important tools are distance-based RDA (dbRDA, McArdle and Anderson 2001), and fuzzy set ordination (FSO, Roberts 2008), both of which let the user select their own dissimilarity measure. Other methods (redundancy analysis or constrained correspondence analysis) force an implicit dissimilarity measure and make linearity/distributional assumptions.
### RDA: redundancy analysis (assumes linear species responses) m1 <- rda(spe ~ k + sand + conductivity, data=env) ### CCA: constrained correspondence analysis (assumes unimodal species responses) m2 <- cca(spe ~ k + sand + conductivity, data=env) ### dbRDA: distance-based RDA m3 <- dbrda(D ~ k + sand + conductivity, data = env, comm = spe, add = 'lingoes') ### FSO: fuzzy set ordination m4 <- fso::fso( ~ k + sand + conductivity, dis = D, data = env, permute = F) ### summaries print(m1) print(m2) print(m3) summary(m4) ### compare four kinds of ordination u <- colvec(env$k) # color by soil potassium par(mfrow=c(2,2), bty='l', las=1) plot(m1, disp='wa') points(m1, pch=16, col=u) plot(m2, disp='wa') points(m2, pch=16, col=u) plot(m3, disp='wa') points(m3, pch=16, col=u) plot(scores(m4$mu[,1:2]), pch=16, col=u, asp=1, xlab='MFSO1, potassium', ylab='MFSO2, sand')
McArdle, B.H. & Anderson, M.J. 2001. Fitting multivariate models to community data: a comment on distance-based redundancy analysis. Ecology 82: 290-297.
Roberts, D.W. 2008. Statistical analysis of multidimensional fuzzy set ordinations. Ecology 89: 1246-1260.
Ward's clustering is a hierarchical cluster analysis method based on community dissimilarities (Murtagh and Legendre 2014).
### perform the clustering k <- 7 # number of groups is specified in advance cl <- hclust(D, method='ward.D2') # clustering solution grp <- cutree(cl, k) # group memberships ### plot the dendrogram, and plot the groups onto an NMS ordination u <- colvec(grp) par(mfrow=c(1,2), bty='l', las=1, oma=c(0,0,0,0), mar=c(3,3,0.5,0.5)) plot(cl, cex=0.5, main='') rect.hclust(cl, k, border=unique(u)) plot(scores(m1), pch=16, cex=0.7, col=u) ordicluster(m1, cl, prune=k, lwd=0.5, col=u)
Non-hierarchical methods also exist. And, it may be useful to define groups based on fuzzy rather than crisp partitions, using fuzzy clustering (de Caceres et al. 2010).
ns <- 1 # number of random starts (increase for real analyses!) k <- 7 # number of groups is specified in advance # fuzzy c-means cl <- vegclustdist(D, mobileMemb=k, method='FCM', m=1.2, nstart=ns) # fuzzy c-means with a noise cluster cl_a <- vegclustdist(D, mobileMemb=k, method='NC', m=1.2, dnoise=0.8, nstart=ns) # fuzzy c-medoids cl_b <- vegclustdist(D, mobileMemb=k, method='FCMdd', m=1.2, nstart=ns) # crisp k-means cl_c <- vegclustdist(D, mobileMemb=k, method='KM', nstart=ns) # how well do the other methods agree with fuzzy c-means? concordance(cl, cl_a) concordance(cl, cl_b) concordance(cl, cl_c)
Let's examine group memberships grp
from the fuzzy clustering. While memberships are fuzzy, we can ask for a crisp membership based on a threshold or else the maximum fuzzy membership.
head(round(cl$memb, 2)) # fuzzy membership (grp <- defuzzify(cl, 'cut', alpha=0.8)[[2]]) # crisp, at threshold (incl NAs) (grp <- defuzzify(cl, 'max')[[2]]) # crisp membership, at max membership table(grp, useNA='always') # tally points per group
Also examine the group characteristics:
### examine group characteristics cl$size # group size (sum of fuzzy memberships) cl$withinss # group sums-of-squares round(ctr <- clustcentroid(D, grp),3) # group centroids (medoids <- clustmedoid(spe, grp)) # *indices* of group medoids clustvar(D, grp) # within-group variance clustvar(D) # pooled among-group variance as.dist(D_ctr <- as.matrix(interclustdist(D, grp))) # dissim among centroids as.dist(D_med <- as.matrix(D)[medoids,medoids]) # dissim among medoids
How are species represented per group? Examine species frequency (constancy) per group:
con <- clustconst(spe, memb=as.memb(grp)) tabasco(t(con), col=get_palette()) # constancy per group summary(con, mode='all') # constancy per group summary(con, mode='cluster', 'M1') # examine one particular group
How to choose a valid number of groups? Systematically vary k, then evaluate. PCN = normalized partition coefficient (higher is better). PEN = normalized partition entropy (lower is better).
### systematically vary k from 3 to 9, then evaluate cl_k <- random.vegclustdist(D, cmin=3, cmax=9, nstart=3, method='FCM', m=1.2) sapply(seq_along(cl_k), function(i) vegclustIndex(cl_k[[i]])) # evaluate!
You may wish to add new sites to an existing classification. For example, imagine we fit a model using "western" sites, and now must assign values for the "eastern" sites.
### assign *new* sites to an *existing* classification --- i <- (xy$x < 220) # split the dataset, based on location west <- spe[i,] # old calibration data = west west <- west[,colSums(west) > 0] # exclude zero-sum species east <- spe[!i,] # new evaluation data = east east <- east[,colSums(east) > 0] # exclude zero-sum species k <- 7 # number of groups specified in advance (grp_west <- cutree(hclust(vegdist(west), 'ward.D2'), k)) # *existing* model cl_west <- as.vegclust(vegdist(west), grp_west) # convert to vegclust object m <- conformveg(west, east) # merge datasets DD <- as.matrix(vegdist(rbind(m$x,m$y))) # ALL dissimilarities DD <- DD[(NROW(west)+1):NCOL(DD), 1:NROW(west)] # rows = new, cols = old cl_east <- vegclass(cl_west, DD) # assign new points given existing model grp_west <- defuzzify(cl_west)[[2]] # memberships of *old* points grp_east <- defuzzify(cl_east)[[2]] # memberships of *new* points grp <- c(grp_west,grp_east) par(mfrow=c(1,2), oma=c(0,0,0,0), mar=c(2,2,1,0)) plot(xy, col=i+1, pch=16) # map east/west locations plot(xy, col=colvec(grp), pch=16) # map group memberships text(xy, grp) # identify group membership legend('bottomleft', leg=1:k, fill=unique(colvec(grp)), title='Group', cex=0.7)
# optimal partitioning require(optpart)
de Caceres, M., X. Font, F. Oliva. 2010. The management of numerical vegetation classifications with fuzzy clustering methods. Journal of Vegetation Science 21: 1138–1151.
Murtagh, F., and P. Legendre. 2014. Ward's hierarchical agglomerative clustering method: which algorithms implement Ward's criterion? Journal of Classification 31: 274–295.
Let's assume PERMANOVA is the best choice for comparing groups in multivariate space. Other methods (ANOSIM, MRPP) exist, but PERMANOVA usually has better power and false-detection rates, even when groups may have different dispersions (Anderson and Walsh 2013) among other operational advantages.
### define and visualize four groups brk <- quantile(env$k, probs=seq(0,1,by=0.25)) # define breaks grp_k <- cut(env$k, brk, include.lowest=T, labels=c('lo','med','hi','veryhi')) # group memberships table(grp_k, useNA='always') # group tally tapply(env$k, INDEX = grp_k, FUN = mean) # group means plot(m1$points, pch=NA) # visualize on the NMS text(m1$points, labels=grp_k, col=as.numeric(grp_k)) # group memberships ordispider(m1, groups=grp_k, col=1:4) # group centroids
PERMANOVA tests for differences in multivariate centroid, in the space of the chosen dissimilarity measure. Recall that matrix D
was our matrix of pairwise Bray-Curtis dissimilarities.
### permanova: test for differences in multivariate *centroid* a1 <- adonis(D ~ grp_k, permu=999) a1
How do we interpret this result? Based on Bray-Curtis dissimilarities, four groups defined by soil potassium significantly differed in community compositions (PERMANOVA pseudo-F = 10.2, permutational p = 0.001, R^2^ = 0.25). The NMS ordination helps us see that this is largely driven by the "veryhi" group. Which you can test formally by repeating PERMANOVA for pairwise multiple comparisons.
A related test, PERMDISP, tests for differences in multivariate dispersion, in the space of the chosen dissimilarity measure. This is interpreted as a test of among-group homogeneity of community compositions, i.e., beta-diversity. Conveniently, we can ask for pairwise multiple comparisons.
### permdisp: test for differences in multivariate *dispersion* b1 <- betadisper(D, grp_k) b1 permutest(b1, pairwise=TRUE, permu=999) boxplot(b1)
How do we interpret this? Based on Principal Coordinates reduction of original Bray-Curtis dissimilarities, the four groups defined by soil potassium significantly differed in homogeneity of community compositions (PERMDISP F = 7.98, permutational p = 0.001). This was due to the "veryhi" potassium group having significantly lower dispersion (lower beta-diversity) than the lower-potassium groups.
PERMANOVA works as 'regular' ANOVA when using Euclidean distances. We'd expect a different p-value since we're using permutations, but identical F-values and sums-of-squares.
De <- dist(env$k, 'euc') # Euclidean distances adonis(De ~ grp_k, perm=999) # examine F and SS print(anova(lm(k ~ grp, env))) # expect identical F and SS
For blocked designs (e.g., for repeated measures or genuine blocked designs), we must specify how permutations occur. Because units are only exchangeable within blocks, the permutations should occur within blocks.
### blocked design: permutations must occur within strata blk <- factor(LETTERS[sample(rep(1:12,len=nrow(env)))]) # arbitrary 'blocks' plot(m1) # visualize on NMS ordispider(m1, blk, label=TRUE) customperm <- how(nperm=999) # set number of permutations setBlocks(customperm) <- blk # permute only *within* blocks adonis(D ~ grp_k, permutations=customperm) # correct test
Anderson, M. J., and D. C. I. Walsh. 2013. PERMANOVA, ANOSIM, and the Mantel test in the face of heterogeneous dispersions: What null hypothesis are you testing? Ecological Monographs 83:557–574. https://doi.org/10.1890/12-2010.1
The Indicator Value (IndVal) is the product of a species' fidelity and specificity to a group (Dufrêne and Legendre 1997).
Here we use labdsv::indval
to examine indicators of single groups, but also see R package indicspecies
to test whether species might indicate multiple groups.
First, let's use the same groups we used for PERMANOVA: which species are indicative of the four groups defined by soil potassium? We can compare these "real" indicator species against a random expectation.
### real groups grp <- cut(env$k, quantile(env$k, probs=seq(0,1,by=0.25)), include.lowest=T, labels=c('lo','med','hi','veryhi')) # group memberships iv <- labdsv::indval(spe, grp) # indicator species analysis for *real* groups summary(iv) # IndVal observed ### random groups rnd <- sample(grp, length(grp), replace=T) # define random groups by bootstrapping ivr <- labdsv::indval(spe, rnd) # indicator species analysis for *random* groups summary(ivr) # IndVal expected at random
How many indicator species would you expect at random?
### null expectation setting alpha = 5% ceiling( ncol(spe) * 0.05 )
If you defined groups by a clustering method using species information, do not use IndVal to find indicator species of those groups (that would be circular logic).
Dufrêne, M. and Legendre, P. 1997. Species assemblages and indicator species: the need for a flexible asymmetrical approach. Ecological Monographs 67(3): 345--366.
de Cáceres, M., P. Legendre, and M. Moretti. 2010. Improving indicator species analysis by combining groups of sites. Oikos 119:1674–1684.
Recall the traits matrix, which assigns each species (rows) a value for each of twelve traits (columns):
names(tra) head(tra) hist(tra$lfp, breaks=11)
We can ask several questions about trait diversity, convergence/divergence, and relationships to environment.
The mean pairwise distance (MPD) shows how far (on average) the traits of one community are relative to all others. Most commonly these are Euclidean or Gower distances. A randomization test gives a test statistic called SES, standardized effect size. Significant positive SES indicates traits are less similar than expected at random (i.e., species have divergent traits), while negative SES indicates traits are more similar than expected at random (i.e., species tend to converge on similar traits). We can examine trait divergence/convergence along a gradient, or test for divergence/convergence within groups.
### Euclidean trait distance (traits already scaled 0-1) Dt <- dist(tra, method='euc') ### calculate trait SES of mean pairwise distances in sites ses <- picante::ses.mpd(spe, Dt, null.model='richness', abund=F, runs=999) ### plot SES across the gradient plot(ses$mpd.obs.z ~ env$k, ylab='Trait SES(MPD)', xlab='Soil potassium', col=ifelse(ses$mpd.obs.p < 0.05, 'red','black')) abline(lm(ses$mpd.obs.z ~ env$k)) # regression line abline(h=0, lty=2) # random-traits line text(0.9, 1, 'Divergent') text(0.9, -4, 'Convergent')
To evaluate the strength of relationship between traits and environment, we can perform ordination of sites based on traits (rather than species). We first need to make a community-weighted means (CWM) matrix, a traits-by-site matrix whose values are trait means across the species in each site.
`makecwm` <- function (spe, tra) { spe <- as.matrix(spe) tra <- as.matrix(tra) `standardize` <- function(x) { (x - min(x, na.rm=TRUE)) / diff(range(x, na.rm=TRUE)) } tra <- apply(tra, MARGIN = 2, FUN = standardize) awt <- spe %*% tra # abundance-weighted trait totals awt / rowSums(spe, na.rm=TRUE) # community-weighted traits matrix } cwm <- data.frame(makecwm(spe, tra)) # make the CWM traits matrix tabasco(cwm, col=get_palette()) # visualize
Ordination of the CWM matrix gives a configuration of sites in traits space, where distance between points indicates dissimilarity of trait compositions (trait syndromes). Nonlinear regression of environmental variables in this traits space can give us the trait-environment correlations that we're after.
### NMS ordination of traits m <- metaMDS(cwm, 'altGower', k=2, trace=0) plot(m, cex=cwm$lfp) # plot sites in *traits space* ### fit one environmental variable in traits space o <- ordisurf(m, env$k, col=3, add=T) ### fit *all* environmental variables in traits space fit <- t(sapply(env, function(i) { g <- summary(vegan::ordisurf(m ~ i, plot=F)) # fit GAM c(`pval` = as.numeric(sprintf('%.3f',round(g$s.pv,3))), `r2` = as.numeric(sprintf('%.3f',round(g$r.sq,3)))) })) cat('GAM p-values and goodness-of-fit, sorted\n', '----------------------------------------\n') fit[rev(order(fit[,2])),]
Here, the environmental variables with the strongest nonlinear relationship to trait compositions were potassium, sodium, and elevation.
RLQ and fourth-corner analysis can also detect linear traits-environment relationships as mediated by the species abundance matrix. There is no need to calculate the CWM matrix. Because it boils down to a series of PCA operations, the RLQ method does not readily admit nonlinear relationships. It is included here for completeness.
### the RLQ method o_spe <- dudi.coa(spe, scannf = FALSE, nf = 2) o_env <- dudi.hillsmith(env, scannf = FALSE, nf = 2, row.w = o_spe$lw) o_tra <- dudi.hillsmith(tra, scannf = FALSE, nf = 2, row.w = o_spe$cw) r <- rlq(o_env, o_spe, o_tra, scannf = FALSE, nf = 2) plot(r) summary(r) randtest(r) fourthcorner.rlq(r, type='Q.axes') fourthcorner.rlq(r, type='R.axes')
Often we suspect that close phylogenetic relatives may share similar trait values than more distant relatives (i.e., trait conservatism). For further analyses it may be useful to "remove" the effects of relatedness from traits, based on the phylogenetic variance-covariance matrix.
### function to do phylogenetic correction `phylo_corr` <- function(phy, tra, ...){ if (class(phy) != 'phylo') stop('phy must be of class `phylo`') if (!is.data.frame(tra)) tra <- as.data.frame(tra) rn <- dimnames(tra)[[1]] # species names cn <- dimnames(tra)[[2]] # traits names n <- dim(tra)[[2]] # n of traits if (!identical(phy$tip.label, rn)) stop('species name mismatch') G <- ape::vcv.phylo(phy) # phylo variance-covariance matrix corfac <- chol(solve(G)) # 'correction factor' per Butler et al. (2000) U <- as.data.frame( # initialize U for 'corrected' trait values matrix(NA, nrow=dim(tra)[[1]], ncol=dim(tra)[[2]], dimnames=list(rn,cn))) tra <- droplevels(tra) # drop unused factor levels tra[,sapply(tra,function(x)nlevels(x)==1)] <- 1 # correct 1-level factors for(j in 1:n){ # corrections per individual trait M <- model.matrix(as.formula(paste0("~0+",cn[j])), data=tra) corrtra <- data.frame(corfac %*% M) U[,j] <- corrtra[rn, ,drop=FALSE] } return(U) } ### do the correction, then compare p <- phylo_corr(phy, tra) par(mfrow=c(2,2)) for(i in 7:10) { plot(tra[,i], p[,i], pch=16, cex=0.7, col='#00000050', main=dimnames(p)[[2]][i], xlab='Original', ylab='Corrected') }
Butler, M.A., T.W. Schoener, and J.B. Losos. 2000. The relationship between sexual size dimorphism and habitat use in Greater Antillean Anolis lizards. Evolution 54:259-272.
Eklöf, A., and D.B. Stouffer. 2016. The phylogenetic component of food web structure and intervality. Theoretical Ecology 9:107–115.
### basic plotting par(mfrow=c(1,3), mar=c(0,0,0,0), oma=c(0,0,0,0)) # plotting parameters phy # basic structure plot(phy, cex=0.75, no.margin=T) # basic plotting ### color tip labels by trait values plot(phy, cex=0.75, tip.color=colvec(tra$lfp), no.margin=T) # axisPhylo(cex=0.6) ### symbolize trait values at tips plot(phy, cex=0.75, label.offset = 48, no.margin=T) tiplabels(pch = 21, bg = c(tra$annual), adj = 0) tiplabels(pch = 21, bg = c(tra$biennial), adj = 15) tiplabels(pch = 21, bg = c(tra$perennial), adj = 30)
According to some people, phylogenetic diversity lets you infer community assembly processes (competitive exclusion, environmental filtering, neutral dynamics). Some other serious people doubt this.
### phylogenetic diversity metrics Dp <- cophenetic(phy) # phylogenetic distances ### Faith's PD phylogenetic diversity = total branch length connecting all species in a site fpd <- pd(spe, phy) # Faith's PD ### mean pairwise distance mpd <- ses.mpd(spe, Dp, null.model='independentswap') ### mean nearest taxon distance mnd <- ses.mntd(spe, Dp, null.model='independentswap') ### bring all together phy_div <- cbind(fpd, mpd = mpd$mpd.obs.z, mntd = mnd$mntd.obs.z) head(phy_div) pairs(phy_div)
Phylogenetic Bray-Curtis dissimilarities = fraction of branch-length shared between communities. Different from the cophenetic branch lengths (between-species), this shows sample unit dissimilarities (between-sites).
### correlation between phylogenetic and taxonomic beta-diversity Dp <- phylosor(spe, phy) # phylogenetic distances protest(Dp, D) # procrustes correlation ### side-by-side NMS ordinations par(mfrow=c(1,2)) plot(scores(metaMDS(D)), col=colvec(1:97), pch=19, main='Taxonomic NMS') plot(scores(metaMDS(Dp)), col=colvec(1:97), pch=19, main='Phylogenetic NMS')
We focus on a measure called Blomberg's K to detect phylogenetic signal in traits (Revell et al. 2008). A value of K > 1.0 indicates that traits of related species are more similar than expected from Brownian motion, and K < 1.0 indicates traits of related species are more dissimilar than expected at random.
### detect phylogenetic signal for traits K <- sapply(tra, FUN=function(j){ names(j) <- rownames(tra) round(picante::phylosignal(j, ape::multi2di(phy)),6)}) as.matrix(K) ### Thinker: which traits have strong evidence for conservatism?
Revell, L.J., L.J. Harmon, and D.C. Collar. 2008. Phylogenetic signal, evolutionary process, and rate. Systematic Biology 57:591–601.
The Mantel test (Mantel 1967) compares a community dissimilarity matrix to a spatial distance matrix (it can also be used to compare any two dissimilarity matrices, for example, to test goodness-of-fit of Euclidean ordination distances to original community dissimilarities, or two ordination configurations). Here we ask, is compositional dissimilarity related to spatial distances?
E <- dist(xy) # spatial distance matrix vegan::mantel(D, E, method='spearman') # spearman *rank* correlation # in `ecodist` package, we also get useful bootstrap CIs: ecodist::mantel(D ~ E, mrank=T) # spearman *rank* correlation
A Mantel correlogram depicts this correlation of compositional dissimilarity at specified lag distances.
brk <- seq(0, round(max(E),-1), by=10) plot(ecodist::mgram(D, E, breaks=brk, nperm=99, mrank=T, nboot=500)) abline(h=0, col='red') plot(vegan::mantel.correlog(D, E, break.pts=brk, cutoff=F, r.type='spearman', nperm=99, mult='holm', progressive=T))
MRM: multiple regression on distance matrices (Lichstein 2007).
# species dissimilarities are related to space (extremely weakly) and # potassium (moderately) ecodist::MRM(D ~ dist(xy) + dist(env$k), nperm=999) # abundance of cosmopolitan bulrush is NOT related to space, # but is related to potassium (moderately) ecodist::MRM(dist(spe$bolboschoenus_maritimus) ~ dist(xy) + dist(env$k), nperm=999)
We can quantify spatial distances among sites with a Euclidean distance matrix. Performing principal coordinates analysis on this "neighbor matrix" gives us PCNMs: principle coordinates of neighbor matrices. PCNM scores are interpreted as orthogonal explanatory variables at varying spatial scales: think of them as progressively broad-, medium- and local-scale predictors.
E <- dist(xy) # euclidean distances between sites pc <- pcnm(E) # principal coordinates of neighbor matrices par(mfrow=c(2,3), oma=c(0,0,0,0), mar=c(2,2,2,0)) # map PCNMs across study area v <- 2^(0:5) # select PCNMs at increasingly fine scale for(i in seq_along(v)) { ordisurf(xy, scores(pc, choices=v[i]), bubble=3, main=paste0('PCNM ',v[i])) }
We can use weighted principle coordinates of neighbor matrices as explanatory variables in CCA to construct a multiscale ordination (MSO, Wagner 2004) variogram. Residual variance here would indicate spatial autocorrelation; we hope for this to exhibit no trend with distance, since PCNMs account for spatial autocorrelation.
rs <- rowSums(spe) / sum(spe) # sites weighted by abundances pc <- pcnm(E, w=rs) # *weighted* PCNMs ord <- cca(spe ~ scores(pc)) # CCA: species constrained by space msord <- mso(ord, xy) # multiscale ordination msoplot(msord, expl=T) # MSO variogram: spatial trend in residuals?
Using the weighted PCNMs above, we can perform variance partitioning of species dissimilarities in dbRDA. This particular variation partitioning shows the amount of variation in species dissimilarities (D) explained by 1) space, 2) environment, and 3) spatially-structured environment.
(vp <- varpart(D, env, scores(pc))) plot(vp, bg = c(2,4)) text(-0.2, 0.3, 'Environment', cex=0.7) text(0.5, 0.3, 'Spatially-\nstructured\nenviro', cex=0.7) text(1.2, 0.3, 'Space', cex=0.7)
We can include more explanatory matrices -- here, species dissimilarities explained by 1) environment, 2) broad spatial, and 3) local spatial.
pc_broad <- scores(pc)[,1:10] pc_local <- scores(pc)[,11:59] (vp <- varpart(D, env, pc_broad, pc_local)) plot(vp, cex=0.7, bg=2:5) text(-0.2, 0.3, 'Environment', cex=0.7) text(0.5, 0.55, 'Broad-\nstructured\nenviro', cex=0.7) text(1.2, 0.3, 'Broad\nspatial', cex=0.7) text(-0.5, -0.75, 'Local-\nstructured\nenviro', cex=0.7) text(0.5, -1.15, 'Local\nspatial', cex=0.7) ### Thinker: why do "broad" and "local" share in common zero variance explained?
We can include more explanatory matrices -- here, species dissimilarities explained by 1) soil fractions, 2) soil chemistry, and 3) spatial processes.
(vp <- varpart(D, # dissimilarities ~ clay + sand + silt, # soil fractions ~ mg + k + conductivity + na_l, # soil chemistry scores(pc), # spatial predictors data=env)) plot(vp, cex=0.7, bg=2:5) text(-0.2, 0.3, 'Soil fractions', cex=0.7) text(1.2, 0.3, 'Soil chemistry', cex=0.7) text(0.5, -1.15, 'Spatial', cex=0.7) text(-0.5, -0.75, 'Spatially-\nstructured\nsoil fractions', cex=0.7) text(1.5, -0.75, 'Spatially-\nstructured\nsoil chemistry', cex=0.7) ### Thinker: how do you interpret this figure? ### Moderate spatial structuring, a bit owing to soil chemistry, ### but mostly to other unmeasured spatial processes (dispersal?). ### Thinker: what does this NOT explain? ### What is the biological cause of residual variation? ### We did ordination in 1) species space, 2) env space, ### 3) env constrained spp space?
Borcard D. and Legendre P. 2002. All-scale spatial analysis of ecological data by means of principal coordinates of neighbour matrices. Ecological Modelling 153: 51-68.
Dray, S., R. Pélissier, P. Couteron, M.-J. Fortin, P. Legendre, P. R. Peres-Neto, E. Bellier, R. Bivand, F. G. Blanchet, M. De Cáceres, A.-B. Dufour, E. Heegaard, T. Jombart, F. Munoz, J. Oksanen, J. Thioulouse, and H. H. Wagner. 2012. Community ecology in the age of multivariate multiscale spatial analysis. Ecological Monographs 82: 257-275.
Lichstein, J. 2007. Multiple regression on distance matrices: a multivariate spatial analysis tool. Plant Ecology 188:117-131.
Mantel, N. 1967. The detection of disease clustering and a generalized regression approach. Cancer Research 27:209-220.
Wagner, H.H. 2004. Direct multi-scale ordination with canonical correspondence analysis. Ecology 85: 342-351.
This workflow uses common best-practices in ecological community analysis. It is somewhat opinionated, and is by no means universal or obligatory!
Data I/O: read data into R (typically read.csv
if CSV files, or load
if in R binary format).
Basic exploration: use basic commands to examine data structure (str
), peek at data (head
), see the frequency distribution (hist
), or make biplots (plot
or pairs
).
Advanced exploration: vegan::tabasco
or labdsv::abuocc
to visualize species abundance matrices.
Transformations: use vegan::tabasco
for relativizations, and arithmetic for transformations. Try log~10~(x+1) for abundances, row-standardize abundances to put sites on equal footing, and range-standardize for environment or traits (so all on same scale) -- McCune and Grace 2002
Outliers: sample units with mean dissimilarities > defined number of SDs from grand mean -- McCune and Grace 2002
Community dissimilarities: use vegan::vegdist
to get Bray-Curtis dissimilarities -- Legendre and Legendre 2012
Environmental distances: use scale
and dist
to get (scaled) Euclidean distances
Dissimilarity adjustments: use vegan::stepacross
for step-across shortest distance in the presence of high beta-diversity -- Smith 2017
Ordination of sites, based on communities: use vegan::metaMDS
for NMS -- Minchin 1987
Ordination of sites, based on environment: use stats::prcomp
for PCA -- linear combinations
Ordination of sites, based on communities constrained by environment: use vegan::dbrda
for dbRDA, or fso::mfso
for FSO -- McArdle and Anderson 2001 or Roberts 2008
Comparing ordination configurations: use vegan::protest
for Procrustes analysis -- Peres-Neto and Jackson 2001
Clustering: use vegclust::vegclust
for fuzzy clustering, or optpart::optpart
for optimal partitioning -- de Cáceres 2010 or Roberts 2015
Group differences: use vegan::adonis2
for PERMANOVA -- Anderson and Walsh 2013
Group indicator species: use labdsv::indval
for the usual single-group IndVal, or indicspecies::multipatt
for multi-group IndVal -- Dufrene and Legendre 1997 or
de Cáceres et al. 2010
R for Data Science{target="_blank"}
Data Carpentry{target="_blank"}
Dave Roberts' labdsv pages{target="_blank"}
Jari Oksanen's vegan introduction - PDF{target="_blank"}
David Zelený's community analysis{target="_blank"}
Tad Fukami's ecological statistics{target="_blank"}
Steve Kembel traits and phylogenies tutorial{target="_blank"}
CRAN Environmetrics Task View{target="_blank"}
vegan{target="_blank"}
labdsv{target="_blank"}
optpart{target="_blank"}
ade4{target="_blank"}
ecodist{target="_blank"}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.