Preamble

This vignette conducts exploratory and descriptive analyses of abiotic data from the WRS dataset. This includes ordination, clustering, informal comparisons between classes of data, and comparison to the site topography and hydrogeology.

Preparing data

data("WRS_abiotic_clean")
data("WRS_abiotic_model")

# Create aggregate data frame and Gower distances
abiotic_data <- data.frame(unname(WRS_abiotic_model))
WRS_abiotic_model$abiotic_aggreg <- abiotic_data
WRS_abiotic_gower <- lapply(WRS_abiotic_model, daisy, metric = "gower")
names(WRS_abiotic_gower) <- paste(names(WRS_abiotic_gower), "gower", sep = "_")

# Create separate data frames for ease of reuse
list2env(WRS_abiotic_model, envir = .GlobalEnv)
list2env(WRS_abiotic_gower, envir = .GlobalEnv)
rm(abiotic_data)

Test for sampling effect

Due to the nature of groundwater sampling, different samples had different volumes of water filtered. This might correlate with the sampling power, which should be visible as an increase in richness with filtration volume. Let's see.

sampling <- WRS_abiotic_clean$sampling
plot(sampling[, c("DNA_conc", "total_vol", "sediment_aliquot", "total_rich")])
# If anything, there might be some relationship between richness and filtration volume
sampling_lm <- lm(total_rich ~ total_vol, data = sampling)
summary(sampling_lm)
plot(sampling_lm)

It's a similar relationship to that we saw in the regional study - not linear, but rather bimodal. There's a threshold effect where above a certain sampling power, higher richness values are more common.

Overall, thought, the effect is minor, and not something to worry about.

NMDS

abiotic_NMDS <- lapply(WRS_abiotic_gower, metaMDS, autotransform = FALSE, noshare = FALSE, wascores = FALSE)
names(abiotic_NMDS) <- names(WRS_abiotic_model)
par(mfrow = c(2, 3))
lapply(seq_along(abiotic_NMDS), FUN = function(i) {
  plot(abiotic_NMDS[[i]], main = names(abiotic_NMDS)[[i]])
  text(abiotic_NMDS[[i]], cex = 1.5)
})
# Check the Shepherd plots
lapply(abiotic_NMDS, stressplot) # looks fine

Aggregate data

Sample 39 (bore WRS39) is the clear outlier - this is a common factor to most datasets since this is the landfill-influenced bore. Sample 40 (bore WRS38) is next in this transect, so it makes sense that it would be next in the ordination. Then samples 14, 22, 37 and 38 are all in the group of bores in the paddock behind the research station, which presumably are receiving uphill recharge from the two landfill-affected bores.

The river is another outlier, and the bores in this end of the arch (29, 42, 43, 44, 45 etc.) are all close to the river. So that makes sense.

Class-by-class

Overall, samples 39 and 48 are outliers in just about every data class, although in "elements" and "nutrients" this is less pronounced for the river sample (48).

There appears to be some natural clustering in the "elements" NMDS. The samples in the right-hand grouping are all down on the river, in or accessing the parafluvial zone. The "hydrogeo" ordination is too strongly influenced by the categorical variable (hydrogeological typology) to interpret this in light of the site topology.

Others are too complex to interpret just from the NMDS.

PCoA

abiotic_PCoA <- lapply(WRS_abiotic_gower, wcmdscale, eig = TRUE, x.ret = TRUE)
names(abiotic_PCoA) <- names(abiotic_NMDS)
lapply(seq_along(abiotic_PCoA), FUN = function(i) {
  plot(abiotic_PCoA[[i]], main = names(abiotic_PCoA)[[i]], cex = 1.5)
})

lapply(abiotic_PCoA, stressplot)

As far as I can tell, this largely recapitulates the structuring from the NMDS, although distances are slightly less pronounced.

Clustering

Using method "average" (UPGMA I think)

abiotic_clust <- lapply(WRS_abiotic_gower, hclust, method = "average")
names(abiotic_clust) <- names(abiotic_NMDS)
lapply(seq_along(abiotic_clust), function(i) {
  plot(abiotic_clust[[i]], main = names(abiotic_clust)[i])
})
# Cophenetic correlation
abiotic_cophenetic <- lapply(abiotic_clust, cophenetic)

gower_clust_cor <- lapply(seq_along(abiotic_cophenetic), function(i) {
  cor(abiotic_cophenetic[[i]], WRS_abiotic_gower[[i]])
})
gower_clust_cor

That's rather interesting. Sample 39 is an outgroup (understandably) except in two instances: for "hydrogeo", where sample 48 gets its own factor level ("RIVER") and so is the outlier, and in "nutrients". In "nutrients", sample 39 clusters with 48 - both strongly surface-influenced, perhaps? - and, interestingly, with 22. That's sample WRS27_S, which is in the middle of the paddock borefield behind the station, right next to the extraction well. So either this is where the landfill leachate plume goes (bypassing all the other bores along that transect - unlikely), or this bore has been strongly influenced by the massive surface exposure of the adjacent extraction well. Very possible.

Interpretable clusters

# By silhouette width
sil_tree <- function(dend_input, dist_input) {
  clusters <- 1:length(dend_input$labels)
  names(clusters) <- clusters
  for(i in 2:(length(clusters) - 1)) {
    sil_cut <- silhouette(cutree(dend_input, k = i), dist_input)
    clusters[i] <- summary(sil_cut)[["avg.width"]]
  }
  return(clusters)
}

sil_clusts <- data.frame(
  sapply(seq_along(abiotic_clust), function(i) {
    sil_tree(dend_input = abiotic_clust[[i]], dist_input = WRS_abiotic_gower[[i]])
    }))
sil_clusts <- sil_clusts[c(-1, -48), ]
names(sil_clusts) <- names(abiotic_clust)

lapply(names(abiotic_clust), function(i) {
  plot(2:(nrow(sil_clusts) + 1), sil_clusts[[i]], type = "h",
       main = i,
       xlab = "k (number of groups)", ylab = "Average silhouette width")
})

There's not really an obvious solution here as to how many clusters should be chosen. But I think maybe 3 would work. That splits off the outlier (usually sample 39) and divides the remainders into two groups - river-influenced and not. It's a very coarse categorisation but that might be appropriate.

Now it's time to plot those clusters on the NMDS and PCoA so we can have a look at how well they match up.

# Create cluster labels
clust_cut <- sapply(abiotic_clust, cutree, k = 3)
# Superimpose cluster labels onto ordination plots
# NMDS
lapply(seq_along(abiotic_NMDS), FUN = function(i) {
  plot(abiotic_NMDS[[i]], main = names(abiotic_NMDS)[[i]])
  text(abiotic_NMDS[[i]], cex = 1.5, labels = clust_cut[, i])
})

# PCoA
lapply(seq_along(abiotic_PCoA), FUN = function(i) {
  plot(abiotic_PCoA[[i]], type = "n", main = names(abiotic_PCoA)[[i]])
  text(abiotic_PCoA[[i]]$points[, 1:2], labels = clust_cut[, i], cex = 1.5)
})

Interestingly, the separation is quite strong except by nutrients, where more of a gradient structure is visible.

That was a fairly trivial example; let's re-run with a higher number of clusters.

# Create cluster labels
clust_cut <- sapply(abiotic_clust, cutree, k = 4)
# Superimpose cluster labels onto ordination plots
# NMDS
lapply(seq_along(abiotic_NMDS), FUN = function(i) {
  plot(abiotic_NMDS[[i]], main = names(abiotic_NMDS)[[i]])
  text(abiotic_NMDS[[i]], cex = 1.5, labels = clust_cut[, i])
})

# PCoA
lapply(seq_along(abiotic_PCoA), FUN = function(i) {
  plot(abiotic_PCoA[[i]], type = "n", main = names(abiotic_PCoA)[[i]])
  text(abiotic_PCoA[[i]]$points[, 1:2], labels = clust_cut[, i], cex = 1.5)
})

In the aggregate dataset, the river and the landfill-influenced sample 39 are split into their own groups (makes sense). This is not always the case in the individual data classes. One more time:

# Create cluster labels
clust_cut <- sapply(abiotic_clust, cutree, k = 5)
# Superimpose cluster labels onto ordination plots
# NMDS
lapply(seq_along(abiotic_NMDS), FUN = function(i) {
  plot(abiotic_NMDS[[i]], main = names(abiotic_NMDS)[[i]])
  text(abiotic_NMDS[[i]], cex = 1.5, labels = clust_cut[, i])
})

# PCoA
lapply(seq_along(abiotic_PCoA), FUN = function(i) {
  plot(abiotic_PCoA[[i]], type = "n", main = names(abiotic_PCoA)[[i]])
  text(abiotic_PCoA[[i]]$points[, 1:2], labels = clust_cut[, i], cex = 1.5)
})

I'd argue that some of the clustering is becoming a bit nonsensical now, particularly group 3 in the aggregate dataset. In terms of parsimonious, ecologically-interpretable divisions which works with all subsets of the data, I'd start with 4 clusters.



mixtrak/wellington.aquifer.ecol documentation built on Nov. 30, 2017, 4:25 a.m.