Helpful resources

  1. "Analysis of Ecological Communities" by McCune & Grace - Great place to start to understand the statistics! Detailed, but not overwhelming. Provides specific recommendations helpful for community ecology. Does NOT provide any R coding help. (Cheapest hardcopies usually from the authors' website: https://www.wildblueberrymedia.net/store/analysis-of-ecological-communities)
  2. "Numerical Ecology" by Legendre & Legendre - Loads of detail about the stats. Comprehensive descriptions of almost every ordination and clustering method available. Recommends some R packages but doesn't provide code. Available free on DOI VPN here: https://www.sciencedirect.com/bookseries/developments-in-environmental-modelling/vol/24/suppl/C
  3. Vegan R package vignette - Lots of handy R coding demos for ordination. Light on stats descriptions. See https://cran.r-project.org/web/packages/vegan/vignettes/intro-vegan.pdf
  4. Some additional handy online tutorials: https://www.rpubs.com/an-bui/vegan-cheat-sheet and http://www.umass.edu/landeco/teaching/multivariate/schedule/summary.handouts.pdf
  5. Your pals in the Moab UseR group!

Community analysis goals and workflow

In this demo, I'm trying to identify plant communities within a set of plots that occur on similar soil types (ask Travis about his Ecological Site Groups if you want the nitty-gritty). I want to know how distinct the plant communities are from one another, and whether there are any major environmental gradients that separate them.

The general workflow is:

  1. Format your data, pick a distance metric, and use those to calculate a distance matrix
  2. Group the plots with a cluster analysis
  3. Determine the appropriate number of groups (many method options - I'll do an indicator species analysis)
  4. Investigate relationships among groups (e.g. do they overlap a lot?) with ordination
  5. Look at how environmental variables correspond to the plant community structure with more ordination-related tools

I've structured this around plant community analysis because that's my jam, but there are plenty of other reasons to use clustering and ordination - feel free to bring up your own scenarios as we go through this!

Data prep and distance matrix

knitr::opts_chunk$set(echo = TRUE)
target_ESG <- "Semiarid_Warm_SandyUplands_LoamyUplands"
user <- "Anna"
#user <- "VPN"


library(STMdevelopment)
library(dplyr) # for data wrangling
library(kableExtra) # for formatting tables in RMarkdown
library(vegan) # for clustering and ordination!
indicators <- c("AH_C3NativePerenGrassCover", 
                "AH_C3IntroducedPerenGrassCover",
                "AH_C4NativePerenGrassCover", 
                "AH_C4IntroducedPerenGrassCover",
                #"AH_IntroducedPerenGrassCover", # Non-native perennial grasses
                "AH_NativePerenForbCover", # Native perennial forbs
                "AH_IntroducedPerenForbCover", # Non-native perennial forbs
                "AH_NativeAnnGrassCover", # Native annual grasses
                "AH_IntroducedAnnGrassCover", # Non-native annual grasses
                "AH_NativeAnnForbCover", # Native annual forbs
                "AH_IntroducedAnnForbCover", # Non-native annual forbs
                "BareSoilCover", # Bare soil
                "CP_percent_100to200", # Canopy gaps > 100 cm TODO do we want annual or perennial gaps?
                "CP_percent_200plus",
                "FH_LichenCover", # Lichen + moss combined cover
                "FH_MossCover")

shrub_by_spp <- T # All shrubs and sub-shrubs by species
subshrub_by_spp <- T
tree_by_spp <- T # All trees by species
opuntia_combined <- T

data_sources <- c("BadgerWash",
                             "CRCMonitoring",
                             "IM_NCPN",
                             "LMF",
                             #"NRI",
                             "Parashant",
                             "AIM", 
                             "VanScoyocThesis"
                           )

plot_data <- plot_data_pull(user=user,
                             target_ESG = target_ESG,
                           data_sources = data_sources,
                           indicators = indicators,
                           shrub_by_spp = shrub_by_spp,
                           subshrub_by_spp = subshrub_by_spp,
                           tree_by_spp = tree_by_spp,
                           opuntia_combined = opuntia_combined
                           )

Decide what variables to include. Often in community ecology, people include every species as its own variable. This data set covers a large area with a LOT of herbaceous species, so we lumped them into functional groups pertinent to our objectives.

indicator_descriptions <- make_indicator_descriptions(indicators = indicators,
                                                      shrub_by_spp = shrub_by_spp,
                                                      subshrub_by_spp = subshrub_by_spp,
                                                      tree_by_spp = tree_by_spp,
                                                      opuntia_combined = opuntia_combined)

kbl(x = indicator_descriptions[,-1], format = "html", row.names = F, align = "c",
    caption = "Indicators used in ordination and clustering analysis to develop plant communities") %>%
   kable_styling(bootstrap_options = "bordered") %>%
  row_spec(kable_input=., row = 0:nrow(indicator_descriptions), background = "lightsteelblue") %>%
  collapse_rows(columns = 1, valign = "top")

Other data prep things to consider:

# remove incomplete rows - Bray Curtis distance can't work with NAs
plot_data_clean <- na.omit(plot_data)

# Some plots were sampled multiple times - I'll include only the first year that a plot has complete data
plot_data_first <- plot_data_clean %>%
  dplyr::group_by(PlotCode) %>%
  dplyr::arrange(.data=., Year, .by_group = TRUE) %>%
  dplyr::filter(dplyr::row_number()==1) %>%
  dplyr::ungroup()

# make the clustering and ordination data frame - can't include ANY columns except the actual variables!
ord.df <- dplyr::select(plot_data_first, -SourceKey, -PlotID, -SiteName, -PlotName,
                        -Year, -Longitude_NAD83, -Latitude_NAD83,
                        -Month, -Day) %>%
  tibble::column_to_rownames("PlotCode") # keep an ID code for the plot as the row name to prevent data scrambling problems

# remove rare species
sp_pa <- mutate(rowwise(ord.df), across(everything(), function(x){if(x>0){1}else{0}})) # make a presence/absence data frame
sp_keep <- names(colSums(sp_pa)[which(colSums(sp_pa)>10)])
ord.df <- select(ord.df, all_of(sp_keep))


head(ord.df)

Distance measures (aka dissimilarity measures)

dist.df <- vegan::vegdist(ord.df, method = "bray") # creates Bray-Curtis distance matrix

Clustering

Hierarchical clustering

There are dozens of ways to do hierarchical clustering! The two things you have to choose are a distance measure (see above) and a linkage method compatible with your distance measure. A common combo that works well for a lot of ecological community data is to use Bray-Curtis distance with a flexible beta linkage (β = –0.25).

# cluster and indicator species analysis
library(cluster) # for cluster analysis
library(dendextend) # for pretty cluster plotting

#clust <- hclust(dist.df, method = "average")
clust1 <- agnes(x= dist.df, diss=T, method="flexible", par.method = c(0.625,0.625,-0.25,0)) # flexible beta with beta=-0.25
#plot(clust1)
hclust1 <- as.hclust(clust1) # covert to class hclust for easier plotting
plot(hclust1, labels = FALSE)

How many groups should I have?

The hardest part about clustering is determining the right number of groups! Some strategies: * Silhouette plots (compare dissimilarity between a plot and its group vs. between a plot and everything not in its group) * Maximize the correlation between the distance matrix and a binary grouping matrix * Indicator species analysis (i.e. species fidelity)

library(labdsv) # indicator species analysis
library(foreach)

# use IndVal from Dufrene and Legendre (1997)
# groupings from hierarchical clustering (grp)

# create vectors of groupings with different numbers of clusters
grps <- data.frame(site=1:nrow(ord.df))
for (i in 2:12) {
  grps <- cbind(grps, cutree(hclust1, k=i))
}
colnames(grps) <- c("site", paste(2:12, "groups", sep="_"))

# calculate IndVal for each species in each number of clusters
listnames <- paste("indvals", 2:12, sep="")

ivlabs_list <- foreach(i=2:12) %do% indval(x=ord.df, clustering =grps[,i], numitr = 1000)

names(ivlabs_list) <- listnames

# Create a data frame with three columns: number of clusters, average p, number of sig ind spp
ivlabs_df <- data.frame(nclusters=2:12, p_mean=NA, nsignif=NA, sumsignif=NA)
for(i in seq(length(ivlabs_list))){
  ivlabs_df[i,"p_mean"] <- mean(ivlabs_list[[i]]$pval, na.rm = T) 
  ivlabs_df[i, "nsignif"] <- length(which(ivlabs_list[[i]]$pval<=0.05))
  ivlabs_df[i,"sumsignif"] <- sum(ivlabs_list[[i]]$indval, na.rm = T)
}

par(mfrow=c(1, 2))
# plot average p values by number of clusters
plot(x=ivlabs_df$nclusters, y=ivlabs_df$p_mean, xlab="Number of clusters", ylab="Mean p-value")
# plot number of significant indicator species by number of clusters
plot(x=ivlabs_df$nclusters, y=ivlabs_df$nsignif, xlab="Number of clusters", ylab="Number of significant species")
par(mfrow=c(1, 1))

k_groups <- 9 # manually change based on plot results

The idea here is to minimize the mean p-value and maximize the number of significant species. Based on the results in Figure \@ref(fig:clustering), the optimal number of vegetation groupings is r k_groups (mean p-value = r round(ivlabs_df$p_mean[ivlabs_df$nclusters==k_groups], 3); number of indicator species = r ivlabs_df$nsignif[ivlabs_df$nclusters==k_groups]).

It's a good idea to check your indicator species groupings with a good botanist to make sure the groupings make sense bioligically! E.g. here, pinyon and juniper got grouped together, which definitely makes botanical sense. However, the sagebrush groupings are a bit weird - I need to go back and lump all sagebrush subspecies together.

# create groups
grp <- cutree(hclust1, k=k_groups)

# create data frame that identifies indicator species by group:
ivlab <- indval(x=ord.df, clustering = grp, numitr = 1000)

gr <- ivlab$maxcls 
iv <- ivlab$indcls 
pv <- ivlab$pval 

indvalsummary <- data.frame(group=gr, indval=iv, pvalue=pv) #, freq=fr)
indvalsummary <- indvalsummary[order(indvalsummary$group, -indvalsummary$indval),]
indvalsummary$Code <- rownames(indvalsummary)

# add in readable indicator names
indvalsummary <- left_join(x=indvalsummary,
                           y=select(indicator_descriptions, Indicator_code, Indicator),
                           by=c("Code"="Indicator_code")) %>%
  left_join(x=.,
            y=select(read.csv(data_file_paths(user)$species_list), SpeciesCode, ScientificName),
            by=c("Code"="SpeciesCode")) %>%
  mutate(Indicator = ifelse(test = is.na(Indicator), yes = ScientificName, no = Indicator))

if(opuntia_combined){
  indvalsummary$Indicator[which(indvalsummary$Code=="AH_OpuntiaCover")] <- indicator_descriptions$Indicator[which(indicator_descriptions$Indicator_code=="opuntia_combined")]
}

# add in relative abundance
relab <- as.data.frame(ivlab$relabu)
colnames(relab) <- paste0("RelativeAbund_", 1:ncol(relab))
relab$Code <- rownames(relab)
indvalsummary <- left_join(x=indvalsummary, y=relab, by="Code")

# add in relative frequency
relfr <- as.data.frame(ivlab$relfrq)
colnames(relfr) <- paste0("RelativeFreq_", 1:ncol(relfr)) 
relfr$Code <- rownames(relfr)
indvalsummary <- left_join(x=indvalsummary, y=relfr, by="Code")

# create a table for display where relative abundance and relative frequency are only displayed for the group for which the species is an indicator
indvalsummary_disp <- dplyr::select(indvalsummary, group, indval, pvalue, Indicator)
indvalsummary_disp$RelAbund <- NA
for(g in 1:k_groups){
 indvalsummary_disp[indvalsummary_disp$group==g, "RelAbund"] <- 
   indvalsummary[indvalsummary$group==g, paste0("RelativeAbund_", g)]
}

indvalsummary_disp$RelFreq <- NA
for(g in 1:k_groups){
 indvalsummary_disp[indvalsummary_disp$group==g, "RelFreq"] <- 
   indvalsummary[indvalsummary$group==g, paste0("RelativeFreq_", g)]
}

# order data frame by group, then significance; for output file
#indvalsummary_disp <- indvalsummary_disp[order(indvalsummary_disp$group, indvalsummary_disp$indval), ]
indvalsummary_disp <- indvalsummary_disp %>%
  group_by(group) %>%
  arrange(desc(indval), .by_group=T)

indvalsummary_disp_filt <- filter(indvalsummary_disp, pvalue<=0.05) %>%
  mutate(indval=round(indval, 2),
         pvalue=round(pvalue, 3),
         RelAbund=round(RelAbund, 2),
         RelFreq=round(RelFreq, 2)) %>%
  select(group, Indicator, indval, pvalue, RelAbund, RelFreq) %>%
  setNames(c("Group", "Indicator", "Indicator value", "p-value", "Relative abundance", "Relative frequency"))

kbl(x = indvalsummary_disp_filt, format = "html", row.names = F, align = "c",
    caption = "Indicator species for plant community groups. The indicator value of a species is a combined measure of fidelity (always present in the group) and specificity (exclusive to the group), where higher values indicate a stronger association with the given vegetation group. Relative abundance indicates specificity, where 1 means the species only occurs in plots of the given group. Relative frequency indicates fidelity, where 1 means the species always occurs in plots of the given group. Only species with a p-value less than 0.05 are displayed.") %>%
  kable_styling(bootstrap_options = "bordered") %>%
  row_spec(kable_input=., row = 0:nrow(indvalsummary_disp_filt), background = "lightsteelblue") %>%
  collapse_rows(columns = 1, valign = "top")

Going back to our clustering dendrogram, here's what it looks like when we cut the tree to have r k_groups groups.

# Cluster dendrogram plot
hclust1_dend <- as.dendrogram(hclust1)

# veg group color palette
pal_veg <- RColorBrewer::brewer.pal(n=k_groups, name = "Set1")

plot(hclust1_dend, leaflab= "none")
rect.dendrogram(hclust1_dend, # put boxes around each group 
                k=k_groups, 
                cluster = grp, 
                border = pal_veg[unique(cutree(hclust1_dend, k=k_groups)[order.dendrogram(hclust1_dend)])], # this bit puts the groupings in the right order - makes it easier to use the same color for a given group in all your plots
                text = as.character(unique(cutree(hclust1_dend, k=k_groups)[order.dendrogram(hclust1_dend)])), 
                text_cex = 2, 
                lower_rect = -0.25,
                lwd =2)

#plot(ape::as.phylo(hclust1), tip.color = pal_veg[grp], direction = "downwards")

Ordination

Why do an ordination?

Ordination methods

Here, we're using ordination to visualize patterns in the plant communities. The data are zero-inflated (more rare species than common species - this is typical) so NMDS is a good choice.

How many axes?

# The part takes a while!
NMDS_scree_plot<-function(x) { #x is the name of the data frame
  plot(rep(1,2),
       replicate(2,vegan::metaMDS(x,k=1)$stress),
       xlim=c(1,5),
       ylim=c(0,1),
       xlab="# of Dimensions",
       ylab="Stress",
       main="NMDS stress plot")
  for (i in 1:4) {
    points(rep(i+1,2),
           replicate(2,vegan::metaMDS(x,distance="bray",k=i+1)$stress))
  }}


NMDS_scree_plot(ord.df)
ord_dims <- 3 # change manually based on the plot results
#run NMS ordination 
set.seed(1) # always set a seed if you want to be able to rerun your code and get the exact same ordination results!
ord <- metaMDS(ord.df,
               k=ord_dims, # number of dimensions
               trymax = 30)  # can increase this number if you're having convergence problems

Based on Figure \@ref(fig:ordination-scree), ordination stress is acceptable with r ord_dims dimensions, and stress reduction is minimal with additional dimensions. The stress for the ordination with r ord_dims dimensions is r round(ord$stress, 3)..

# 2D plotting
plot_data_first$PlantCommunity <- as.factor(grp)

# manually update based on indicator species
group_labels <- c("1 - Sagebrush shrubland",
                    "2 - Wyoming big sagebrush shrubland",
                    "3 - Sparse shrubland",
                    "4 - C3 perennial grassland with
mixed shrubs",
                    "5 - Pinyon-juniper woodland",
                    "6 - Mixed oak shrubland",
                    "7 - Introduced annual herbaceous 
community",
                    "8 - Ephedra shrubland with biocrust",
                    "6 - C4 perennial grassland"
)

par(mfrow=c(2,2))
# Axes 1x2
plot(ord, choices = c(1,2), type = "n", # plot the axes
     xlim = c(-1.25, 1.25),
     ylim = c(-1, 1))
points(ord, choices = c(1,2), display = "sites", # plot points - can choose "sites" or "species"
       col=pal_veg[plot_data_first$PlantCommunity],
       pch = 21, cex = .6, bg=pal_veg[plot_data_first$PlantCommunity])
ordiellipse(ord, plot_data_first$PlantCommunity, col=pal_veg, lwd = 2, label = T) # plot your groups
# can use oriellipse, orihull, or ordispider to plot groups, depending on your needs

# Axes 3x2
plot(ord, choices = c(3,2), type = "n", 
     xlim = c(-1.25, 1.25),
     ylim = c(-1, 1))
points(ord, choices = c(3,2), display = "sites", 
       col=pal_veg[plot_data_first$PlantCommunity],
       pch = 21, cex = .6, bg=pal_veg[plot_data_first$PlantCommunity])
ordiellipse(ord, plot_data_first$PlantCommunity, col=pal_veg, lwd = 2, label = T) 

# Axes 1x3
plot(ord, choices = c(1,3), type = "n",
     xlim = c(-1.25, 1.25),
     ylim = c(-1, 1))
points(ord, choices = c(1,3), display = "sites",
       col=pal_veg[plot_data_first$PlantCommunity],
       pch = 21, cex = .6, bg=pal_veg[plot_data_first$PlantCommunity])
ordiellipse(ord, plot_data_first$PlantCommunity, col=pal_veg, lwd = 2, label = T) 

# Legend
plot(ord, type = "n", axes=FALSE,
     display = "sites",
     col=pal_veg[plot_data_first$PlantCommunity],
     xlab = "",
     ylab = "")
legend(x="center",
       legend = group_labels,
       fill = pal_veg,
       title = "Plant communities")

par(mfrow=c(1,1))
# Axes 1x2
plot(ord, choices = c(1,2), type = "n", # plot the axes
     xlim = c(-1.25, 1.25),
     ylim = c(-1, 1))
text(ord, choices = c(1,2), display = "species")
ordiellipse(ord, plot_data_first$PlantCommunity, col=pal_veg, lwd = 2, label = T) # plot your groups

You can plot in 3 dimensions if you're feeling fancy!

# make a 3d plot
library(vegan3d)

invisible(rgl::open3d())

ordirgl(ord, display = "sites", type = "n", alpha = 1,
        arr.col = "blue"
        ) # alpha is transparency of color - 0 is fully transparent, 1 is not transparent
orglellipse(ord, groups = plot_data_first$PlantCommunity, kind = "sd", col = pal_veg, alpha = 0.4)

rgl::rglwidget() # this embeds the 3D plot in the HTML doc (and it's zoomable! Ooooooooh, aaaaaaaah!)

Fitting environmental variables

To examine relationships between the plant community structure and environmental variables, you can fit an environmental matrix to your ordination.

The standard arrow diagrams assume linear relationships. Surface plots can help you decide if the relationship is approximately linear.

I didn't have time to put together the environmental data for this demo, but here's some code to try it out on your own!

ord.full <- read.csv("path_to_some_environmental_data")

ord.fit <- envfit(ord ~ Elevation_ft,
                  data = ord.full,
                  na.rm=T,
                  choices=c(1:3)) # "choices" indicates number of dimensions

par(mfrow=c(1,2))
# Axes 1x2
plot(ord, choices = c(1,2), type = "n", # plot the axes. "choices" indicates which pair of axes to plot.
     xlim = c(-1.25, 1.25),
     ylim = c(-1, 1))
points(ord, choices = c(1,2), display = "sites", # plot points - can choose "sites" or "species"
       col=pal_veg[plot_data_first$PlantCommunity],
       pch = 21, cex = .6, bg=pal_veg[plot_data_first$PlantCommunity])
ordiellipse(ord, plot_data_first$PlantCommunity, col=pal_veg, lwd = 2, label = T) # plot your groups
plot(ord.fit, add=T)

# Legend
plot(ord, type = "n", axes=FALSE,
     display = "sites",
     col=pal_veg[plot_data_first$PlantCommunity],
     xlab = "",
     ylab = "")
legend(x="center",
       legend = group_labels,
       fill = pal_veg,
       title = "Plant communities")

par(mfrow=c(1,1))
plot(ord, choices=c(1,3))
ordisurf(ord~Elevation_ft, ord.full, labcex=1, choices=c(1,3)) # non-linear
plot(ord.fit_Elev, choices=c(1,3))
# TO MAKE THE 3D PLOT WORK, DON'T USE THE KNIT BUTTON! COPY AND PASTE THE CODE BELOW INTO THE CONSOLE!!
#rmarkdown::render(input = "C:/Users/aknight/Desktop/Telework_Backups/V_drive/PROJECTS/ESG/STM/Scripts/STMdevelopment/R/ordination_clustering_demo.Rmd", output_format = "html_document", output_file = "C:/Users/aknight/Desktop/Telework_Backups/V_drive/PROJECTS/ESG/STM/ordination_clustering_demo.html")


annack84/STMdevelopment documentation built on April 12, 2024, 6:46 p.m.