In this demo, I'm trying to identify plant communities within a set of plots that occur on similar soil types (ask Travis about his Ecological Site Groups if you want the nitty-gritty). I want to know how distinct the plant communities are from one another, and whether there are any major environmental gradients that separate them.
The general workflow is:
I've structured this around plant community analysis because that's my jam, but there are plenty of other reasons to use clustering and ordination - feel free to bring up your own scenarios as we go through this!
knitr::opts_chunk$set(echo = TRUE) target_ESG <- "Semiarid_Warm_SandyUplands_LoamyUplands" user <- "Anna" #user <- "VPN" library(STMdevelopment)
library(dplyr) # for data wrangling library(kableExtra) # for formatting tables in RMarkdown library(vegan) # for clustering and ordination!
indicators <- c("AH_C3NativePerenGrassCover", "AH_C3IntroducedPerenGrassCover", "AH_C4NativePerenGrassCover", "AH_C4IntroducedPerenGrassCover", #"AH_IntroducedPerenGrassCover", # Non-native perennial grasses "AH_NativePerenForbCover", # Native perennial forbs "AH_IntroducedPerenForbCover", # Non-native perennial forbs "AH_NativeAnnGrassCover", # Native annual grasses "AH_IntroducedAnnGrassCover", # Non-native annual grasses "AH_NativeAnnForbCover", # Native annual forbs "AH_IntroducedAnnForbCover", # Non-native annual forbs "BareSoilCover", # Bare soil "CP_percent_100to200", # Canopy gaps > 100 cm TODO do we want annual or perennial gaps? "CP_percent_200plus", "FH_LichenCover", # Lichen + moss combined cover "FH_MossCover") shrub_by_spp <- T # All shrubs and sub-shrubs by species subshrub_by_spp <- T tree_by_spp <- T # All trees by species opuntia_combined <- T data_sources <- c("BadgerWash", "CRCMonitoring", "IM_NCPN", "LMF", #"NRI", "Parashant", "AIM", "VanScoyocThesis" ) plot_data <- plot_data_pull(user=user, target_ESG = target_ESG, data_sources = data_sources, indicators = indicators, shrub_by_spp = shrub_by_spp, subshrub_by_spp = subshrub_by_spp, tree_by_spp = tree_by_spp, opuntia_combined = opuntia_combined )
Decide what variables to include. Often in community ecology, people include every species as its own variable. This data set covers a large area with a LOT of herbaceous species, so we lumped them into functional groups pertinent to our objectives.
indicator_descriptions <- make_indicator_descriptions(indicators = indicators, shrub_by_spp = shrub_by_spp, subshrub_by_spp = subshrub_by_spp, tree_by_spp = tree_by_spp, opuntia_combined = opuntia_combined) kbl(x = indicator_descriptions[,-1], format = "html", row.names = F, align = "c", caption = "Indicators used in ordination and clustering analysis to develop plant communities") %>% kable_styling(bootstrap_options = "bordered") %>% row_spec(kable_input=., row = 0:nrow(indicator_descriptions), background = "lightsteelblue") %>% collapse_rows(columns = 1, valign = "top")
# remove incomplete rows - Bray Curtis distance can't work with NAs plot_data_clean <- na.omit(plot_data) # Some plots were sampled multiple times - I'll include only the first year that a plot has complete data plot_data_first <- plot_data_clean %>% dplyr::group_by(PlotCode) %>% dplyr::arrange(.data=., Year, .by_group = TRUE) %>% dplyr::filter(dplyr::row_number()==1) %>% dplyr::ungroup() # make the clustering and ordination data frame - can't include ANY columns except the actual variables! ord.df <- dplyr::select(plot_data_first, -SourceKey, -PlotID, -SiteName, -PlotName, -Year, -Longitude_NAD83, -Latitude_NAD83, -Month, -Day) %>% tibble::column_to_rownames("PlotCode") # keep an ID code for the plot as the row name to prevent data scrambling problems # remove rare species sp_pa <- mutate(rowwise(ord.df), across(everything(), function(x){if(x>0){1}else{0}})) # make a presence/absence data frame sp_keep <- names(colSums(sp_pa)[which(colSums(sp_pa)>10)]) ord.df <- select(ord.df, all_of(sp_keep)) head(ord.df)
dist.df <- vegan::vegdist(ord.df, method = "bray") # creates Bray-Curtis distance matrix
There are dozens of ways to do hierarchical clustering! The two things you have to choose are a distance measure (see above) and a linkage method compatible with your distance measure. A common combo that works well for a lot of ecological community data is to use Bray-Curtis distance with a flexible beta linkage (β = –0.25).
# cluster and indicator species analysis library(cluster) # for cluster analysis library(dendextend) # for pretty cluster plotting #clust <- hclust(dist.df, method = "average") clust1 <- agnes(x= dist.df, diss=T, method="flexible", par.method = c(0.625,0.625,-0.25,0)) # flexible beta with beta=-0.25 #plot(clust1) hclust1 <- as.hclust(clust1) # covert to class hclust for easier plotting plot(hclust1, labels = FALSE)
The hardest part about clustering is determining the right number of groups! Some strategies: * Silhouette plots (compare dissimilarity between a plot and its group vs. between a plot and everything not in its group) * Maximize the correlation between the distance matrix and a binary grouping matrix * Indicator species analysis (i.e. species fidelity)
library(labdsv) # indicator species analysis library(foreach) # use IndVal from Dufrene and Legendre (1997) # groupings from hierarchical clustering (grp) # create vectors of groupings with different numbers of clusters grps <- data.frame(site=1:nrow(ord.df)) for (i in 2:12) { grps <- cbind(grps, cutree(hclust1, k=i)) } colnames(grps) <- c("site", paste(2:12, "groups", sep="_")) # calculate IndVal for each species in each number of clusters listnames <- paste("indvals", 2:12, sep="") ivlabs_list <- foreach(i=2:12) %do% indval(x=ord.df, clustering =grps[,i], numitr = 1000) names(ivlabs_list) <- listnames # Create a data frame with three columns: number of clusters, average p, number of sig ind spp ivlabs_df <- data.frame(nclusters=2:12, p_mean=NA, nsignif=NA, sumsignif=NA) for(i in seq(length(ivlabs_list))){ ivlabs_df[i,"p_mean"] <- mean(ivlabs_list[[i]]$pval, na.rm = T) ivlabs_df[i, "nsignif"] <- length(which(ivlabs_list[[i]]$pval<=0.05)) ivlabs_df[i,"sumsignif"] <- sum(ivlabs_list[[i]]$indval, na.rm = T) } par(mfrow=c(1, 2)) # plot average p values by number of clusters plot(x=ivlabs_df$nclusters, y=ivlabs_df$p_mean, xlab="Number of clusters", ylab="Mean p-value") # plot number of significant indicator species by number of clusters plot(x=ivlabs_df$nclusters, y=ivlabs_df$nsignif, xlab="Number of clusters", ylab="Number of significant species") par(mfrow=c(1, 1)) k_groups <- 9 # manually change based on plot results
The idea here is to minimize the mean p-value and maximize the number of significant species. Based on the results in Figure \@ref(fig:clustering), the optimal number of vegetation groupings is r k_groups
(mean p-value = r round(ivlabs_df$p_mean[ivlabs_df$nclusters==k_groups], 3)
; number of indicator species = r ivlabs_df$nsignif[ivlabs_df$nclusters==k_groups]
).
It's a good idea to check your indicator species groupings with a good botanist to make sure the groupings make sense bioligically! E.g. here, pinyon and juniper got grouped together, which definitely makes botanical sense. However, the sagebrush groupings are a bit weird - I need to go back and lump all sagebrush subspecies together.
# create groups grp <- cutree(hclust1, k=k_groups) # create data frame that identifies indicator species by group: ivlab <- indval(x=ord.df, clustering = grp, numitr = 1000) gr <- ivlab$maxcls iv <- ivlab$indcls pv <- ivlab$pval indvalsummary <- data.frame(group=gr, indval=iv, pvalue=pv) #, freq=fr) indvalsummary <- indvalsummary[order(indvalsummary$group, -indvalsummary$indval),] indvalsummary$Code <- rownames(indvalsummary) # add in readable indicator names indvalsummary <- left_join(x=indvalsummary, y=select(indicator_descriptions, Indicator_code, Indicator), by=c("Code"="Indicator_code")) %>% left_join(x=., y=select(read.csv(data_file_paths(user)$species_list), SpeciesCode, ScientificName), by=c("Code"="SpeciesCode")) %>% mutate(Indicator = ifelse(test = is.na(Indicator), yes = ScientificName, no = Indicator)) if(opuntia_combined){ indvalsummary$Indicator[which(indvalsummary$Code=="AH_OpuntiaCover")] <- indicator_descriptions$Indicator[which(indicator_descriptions$Indicator_code=="opuntia_combined")] } # add in relative abundance relab <- as.data.frame(ivlab$relabu) colnames(relab) <- paste0("RelativeAbund_", 1:ncol(relab)) relab$Code <- rownames(relab) indvalsummary <- left_join(x=indvalsummary, y=relab, by="Code") # add in relative frequency relfr <- as.data.frame(ivlab$relfrq) colnames(relfr) <- paste0("RelativeFreq_", 1:ncol(relfr)) relfr$Code <- rownames(relfr) indvalsummary <- left_join(x=indvalsummary, y=relfr, by="Code") # create a table for display where relative abundance and relative frequency are only displayed for the group for which the species is an indicator indvalsummary_disp <- dplyr::select(indvalsummary, group, indval, pvalue, Indicator) indvalsummary_disp$RelAbund <- NA for(g in 1:k_groups){ indvalsummary_disp[indvalsummary_disp$group==g, "RelAbund"] <- indvalsummary[indvalsummary$group==g, paste0("RelativeAbund_", g)] } indvalsummary_disp$RelFreq <- NA for(g in 1:k_groups){ indvalsummary_disp[indvalsummary_disp$group==g, "RelFreq"] <- indvalsummary[indvalsummary$group==g, paste0("RelativeFreq_", g)] } # order data frame by group, then significance; for output file #indvalsummary_disp <- indvalsummary_disp[order(indvalsummary_disp$group, indvalsummary_disp$indval), ] indvalsummary_disp <- indvalsummary_disp %>% group_by(group) %>% arrange(desc(indval), .by_group=T) indvalsummary_disp_filt <- filter(indvalsummary_disp, pvalue<=0.05) %>% mutate(indval=round(indval, 2), pvalue=round(pvalue, 3), RelAbund=round(RelAbund, 2), RelFreq=round(RelFreq, 2)) %>% select(group, Indicator, indval, pvalue, RelAbund, RelFreq) %>% setNames(c("Group", "Indicator", "Indicator value", "p-value", "Relative abundance", "Relative frequency")) kbl(x = indvalsummary_disp_filt, format = "html", row.names = F, align = "c", caption = "Indicator species for plant community groups. The indicator value of a species is a combined measure of fidelity (always present in the group) and specificity (exclusive to the group), where higher values indicate a stronger association with the given vegetation group. Relative abundance indicates specificity, where 1 means the species only occurs in plots of the given group. Relative frequency indicates fidelity, where 1 means the species always occurs in plots of the given group. Only species with a p-value less than 0.05 are displayed.") %>% kable_styling(bootstrap_options = "bordered") %>% row_spec(kable_input=., row = 0:nrow(indvalsummary_disp_filt), background = "lightsteelblue") %>% collapse_rows(columns = 1, valign = "top")
Going back to our clustering dendrogram, here's what it looks like when we cut the tree to have r k_groups
groups.
# Cluster dendrogram plot hclust1_dend <- as.dendrogram(hclust1) # veg group color palette pal_veg <- RColorBrewer::brewer.pal(n=k_groups, name = "Set1") plot(hclust1_dend, leaflab= "none") rect.dendrogram(hclust1_dend, # put boxes around each group k=k_groups, cluster = grp, border = pal_veg[unique(cutree(hclust1_dend, k=k_groups)[order.dendrogram(hclust1_dend)])], # this bit puts the groupings in the right order - makes it easier to use the same color for a given group in all your plots text = as.character(unique(cutree(hclust1_dend, k=k_groups)[order.dendrogram(hclust1_dend)])), text_cex = 2, lower_rect = -0.25, lwd =2) #plot(ape::as.phylo(hclust1), tip.color = pal_veg[grp], direction = "downwards")
Here, we're using ordination to visualize patterns in the plant communities. The data are zero-inflated (more rare species than common species - this is typical) so NMDS is a good choice.
# The part takes a while! NMDS_scree_plot<-function(x) { #x is the name of the data frame plot(rep(1,2), replicate(2,vegan::metaMDS(x,k=1)$stress), xlim=c(1,5), ylim=c(0,1), xlab="# of Dimensions", ylab="Stress", main="NMDS stress plot") for (i in 1:4) { points(rep(i+1,2), replicate(2,vegan::metaMDS(x,distance="bray",k=i+1)$stress)) }} NMDS_scree_plot(ord.df) ord_dims <- 3 # change manually based on the plot results
#run NMS ordination set.seed(1) # always set a seed if you want to be able to rerun your code and get the exact same ordination results! ord <- metaMDS(ord.df, k=ord_dims, # number of dimensions trymax = 30) # can increase this number if you're having convergence problems
Based on Figure \@ref(fig:ordination-scree), ordination stress is acceptable with r ord_dims
dimensions, and stress reduction is minimal with additional dimensions. The stress for the ordination with r ord_dims
dimensions is r round(ord$stress, 3)
..
# 2D plotting plot_data_first$PlantCommunity <- as.factor(grp) # manually update based on indicator species group_labels <- c("1 - Sagebrush shrubland", "2 - Wyoming big sagebrush shrubland", "3 - Sparse shrubland", "4 - C3 perennial grassland with mixed shrubs", "5 - Pinyon-juniper woodland", "6 - Mixed oak shrubland", "7 - Introduced annual herbaceous community", "8 - Ephedra shrubland with biocrust", "6 - C4 perennial grassland" ) par(mfrow=c(2,2)) # Axes 1x2 plot(ord, choices = c(1,2), type = "n", # plot the axes xlim = c(-1.25, 1.25), ylim = c(-1, 1)) points(ord, choices = c(1,2), display = "sites", # plot points - can choose "sites" or "species" col=pal_veg[plot_data_first$PlantCommunity], pch = 21, cex = .6, bg=pal_veg[plot_data_first$PlantCommunity]) ordiellipse(ord, plot_data_first$PlantCommunity, col=pal_veg, lwd = 2, label = T) # plot your groups # can use oriellipse, orihull, or ordispider to plot groups, depending on your needs # Axes 3x2 plot(ord, choices = c(3,2), type = "n", xlim = c(-1.25, 1.25), ylim = c(-1, 1)) points(ord, choices = c(3,2), display = "sites", col=pal_veg[plot_data_first$PlantCommunity], pch = 21, cex = .6, bg=pal_veg[plot_data_first$PlantCommunity]) ordiellipse(ord, plot_data_first$PlantCommunity, col=pal_veg, lwd = 2, label = T) # Axes 1x3 plot(ord, choices = c(1,3), type = "n", xlim = c(-1.25, 1.25), ylim = c(-1, 1)) points(ord, choices = c(1,3), display = "sites", col=pal_veg[plot_data_first$PlantCommunity], pch = 21, cex = .6, bg=pal_veg[plot_data_first$PlantCommunity]) ordiellipse(ord, plot_data_first$PlantCommunity, col=pal_veg, lwd = 2, label = T) # Legend plot(ord, type = "n", axes=FALSE, display = "sites", col=pal_veg[plot_data_first$PlantCommunity], xlab = "", ylab = "") legend(x="center", legend = group_labels, fill = pal_veg, title = "Plant communities") par(mfrow=c(1,1))
# Axes 1x2 plot(ord, choices = c(1,2), type = "n", # plot the axes xlim = c(-1.25, 1.25), ylim = c(-1, 1)) text(ord, choices = c(1,2), display = "species") ordiellipse(ord, plot_data_first$PlantCommunity, col=pal_veg, lwd = 2, label = T) # plot your groups
You can plot in 3 dimensions if you're feeling fancy!
# make a 3d plot library(vegan3d) invisible(rgl::open3d()) ordirgl(ord, display = "sites", type = "n", alpha = 1, arr.col = "blue" ) # alpha is transparency of color - 0 is fully transparent, 1 is not transparent orglellipse(ord, groups = plot_data_first$PlantCommunity, kind = "sd", col = pal_veg, alpha = 0.4) rgl::rglwidget() # this embeds the 3D plot in the HTML doc (and it's zoomable! Ooooooooh, aaaaaaaah!)
To examine relationships between the plant community structure and environmental variables, you can fit an environmental matrix to your ordination.
The standard arrow diagrams assume linear relationships. Surface plots can help you decide if the relationship is approximately linear.
I didn't have time to put together the environmental data for this demo, but here's some code to try it out on your own!
ord.full <- read.csv("path_to_some_environmental_data") ord.fit <- envfit(ord ~ Elevation_ft, data = ord.full, na.rm=T, choices=c(1:3)) # "choices" indicates number of dimensions par(mfrow=c(1,2)) # Axes 1x2 plot(ord, choices = c(1,2), type = "n", # plot the axes. "choices" indicates which pair of axes to plot. xlim = c(-1.25, 1.25), ylim = c(-1, 1)) points(ord, choices = c(1,2), display = "sites", # plot points - can choose "sites" or "species" col=pal_veg[plot_data_first$PlantCommunity], pch = 21, cex = .6, bg=pal_veg[plot_data_first$PlantCommunity]) ordiellipse(ord, plot_data_first$PlantCommunity, col=pal_veg, lwd = 2, label = T) # plot your groups plot(ord.fit, add=T) # Legend plot(ord, type = "n", axes=FALSE, display = "sites", col=pal_veg[plot_data_first$PlantCommunity], xlab = "", ylab = "") legend(x="center", legend = group_labels, fill = pal_veg, title = "Plant communities") par(mfrow=c(1,1))
plot(ord, choices=c(1,3)) ordisurf(ord~Elevation_ft, ord.full, labcex=1, choices=c(1,3)) # non-linear plot(ord.fit_Elev, choices=c(1,3))
# TO MAKE THE 3D PLOT WORK, DON'T USE THE KNIT BUTTON! COPY AND PASTE THE CODE BELOW INTO THE CONSOLE!! #rmarkdown::render(input = "C:/Users/aknight/Desktop/Telework_Backups/V_drive/PROJECTS/ESG/STM/Scripts/STMdevelopment/R/ordination_clustering_demo.Rmd", output_format = "html_document", output_file = "C:/Users/aknight/Desktop/Telework_Backups/V_drive/PROJECTS/ESG/STM/ordination_clustering_demo.html")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.