In andrew-hipp/edivColl: Evolutionary Diversity of Collections

Evolutionary distinctiveness

IMLS update, 2021-06-23, Andrew Hipp

As we have been waiting for our final sequencing results, which just came in at the end of May [Hooray!], I have been thinking about how to organize and use the phylogenetic data. I'll present today edivColl -- for Evolutionary Diversity in Collections -- which is the start of an R package that will provide tools for us and for others doing similar work.

The package lives at https://github.com/andrew-hipp/edivColl, but it can be migrated to the shared repository if that's better, or we can fork it into there if we like... whatever is needed. At this point, I'm working in a separate repository b/c that's what I'm used to.

Also, please note that most of the code I am showing here is embedded in the examples for the functions in this package. My plan is that this vignette will eventually serve as a worked example and a gateway into the package and questions that can be addressed with it, and each example will provide enough information to get our IMLS data into our hands and others' for analysis.

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

temp <- require('edivColl')
if(!temp) {
  devtools::install_github('andrew-hipp/edivColl', dep = FALSE)
  # note that if you don't have dependencies already installed, set dep = TRUE or
  #   leave out the dep argument. it's in there now b/c updating dependencies is enforced
  #   and sometimes irritating during development.
  library(edivColl)
}

1. First, let's get our data together

1.1 phylogenies

Our phylogenies as they stand now are included as package data. My thought is that we will leave them as such unless there is some concern about doing so; I'd rather have the data all easily accessible to reviewers and users in the package instead of making them scrounge around in other repositories (which I and almost everyone else does; it's not the end of the world, I just don't think it's ideal if you can avoid it).

The phylogenies we are developing have lots of accession information appended, and multiple inviduals per species. So for example:

library(ape) # this is a standard R phylogenetics package
data(malus_tr) # this is one of Lindsey's Malus trees
head(malus$tip.label, 10) # these are the tip labels as we build them

plot(malus, cex = 0.5)

For understanding phylogeny of the group, the accessions information and inclusion of multiple exemplars per species is essential. But for phylogenetic diversity and distinctiveness, this is most distracting. So let's strip off that information. The cleanPhylo function assumes that you have thrown the taxon name somewhere into the tip label (if you haven't done this, there are helper functions on my lab repository, but I am keeping them separate for the time being b/c this is very lab-specific) and delimited. It strips out trailing and leading whitespace by default and a few other niceties, though others are still needed. See ?cleanPhylo for details.

malus2 <- cleanPhylo(malus, delim = '_|_')
head(malus2$tip.label, 20)

Isn't that nice?

Also, for phylogenetic diversity, we generally want the branch lengths to be in time units; this tree has its branches in units of substitutions / nucleotide. Let's make a rather casual assumption about the rate of evolution evolving along a tree and rescale the tree appropriately.

malus2 <- chronos(malus2)
plot(malus2, cex = 0.6)

1.2 collections data

Okay, that's great. Now our questions are obviously tied not to just the phylogeny, but to the phylogeny and what is at the tips, both what we want to obtain and what we have already. So let's bundle up the phylogeny with data about our garden. There's a function make.consInt that creates a consistent object type, bundling together data about what is in the garden and what is desired with a phylogeny, and includes both the full data and tree and then just a tree subsetted to your taxa, as well as a data table that matches that tree. We'll talk more in a bit about why bother keeping the full tree. What it doesn't do yet is timestamp the data import and the file locations, which would be worth doing! I am relegating this information for the time being to the (partial) documentation of the datasets.

require(magrittr) # parce que ceci n'est pas une pipe
data(accessions.mor) # read MOR accessions data
data(desiderata.mor) # read MOR desiderata list
temp <- c(accessions.mor$Taxon, desiderata.mor$taxon_name) %>%
   unique %>% sort
dat.mor <- data.frame(inGarden = temp %in% accessions.mor$Taxon,
                      wanted = temp %in% desiderata.mor$taxon_name,
                      row.names = temp)
rm(temp)
combo.malus <- make.consInt(malus2, dat.mor)
data(quercus_tr) # while we're at it, I'm throwing in Quercus
quercus2 <- cleanPhylo(quercus)
combo.quercus <- make.consInt(quercus2, dat.mor)
str(combo.quercus)

2. Now we can start thinking about phylogenetic weight of species

2.1 What weights might we use?

There are a few ways we could quantify the phylogenetic weight of a species for conservation purposes.

Global evolutionary distinctiveness (eDisGlobal). This means you would take the whole phylogeny of all spp (extant spp only, presumably, b/c I think from a conservation standpoint, extinct spp exist only via extant spp, unless perhaps if we think of history as having moral standing aside from its endpoints in the current day, which I think I do, but I haven't got this worked out yet in my head) and calculate evolutionary distinctiveness in some way for all tips (we'll get to metrics in a moment). eDivGlobal is context specific w/ respect to what spp are extant at the moment when you calculate it; it ignores what species are in a collection. This is the approach taken in Larkin et al. 2015.
Collections evolutionary distinctiveness (eDisColl). This would entail considering the phylogeny of just a collections-relevant phylogeny, subsetted to those taxa that are in the collections of interest (be they global or local) as well as taxa desired (this could be local or global if the phylogeny represents collections holdings globally; or local only in any case).
Phylogenetic diversity of resulting collection (pDivColl). Here we would flip the question from being "what is the distinctiveness of the species?" to "how does the phylogenetic diversity of what we are collecting change when we add a species or collection of species?" This interests me, because I suspect there are combinatorial effects that won't be realized if we just use rank order (or absolute value) of evolutionary distinctiveness as our phylogenetic criterion for prioritizing collections.

2.2 a little look at eDisGlobal and eDisColl

There are a few functions in the package that calculate these. w.phy calculates evolutionary distinctiveness using a generalized least squares estimator, presented and illustrated in Hipp et al. 2018. I like this b/c it has its roots in an explicit model of trait evolution, and we tend to think of phylogenies as being useful in ecology b/c they track trait histories (though in conservation, we might think of the genomic ghosts of extinct ancestors as having moral standing, irrespective of their trait 'values,' just as we often think of species as having moral standing irrespective of their traits or utility to humans). But it's not widely used. I think only I use it. The more widely used flavor of evolutionary distinctiveness (e.g. as used in the EDGE method) uses summed branch lengths divided by number of descendents.

So eDisCalc calculates these on a consInt object, and returns both the global and collections results.

library(ggplot2)
library(ggrepel)
qcalc <- eDisCalc(combo.quercus)
qcalc$intersect$sp <- row.names(qcalc$intersect)
p <- ggplot(qcalc$intersect,
            aes(y = all.ed.equalSplits,
                x = sub.ed.equalSplits,
                label = sp))
p <- p + geom_point()
p <- p + geom_label_repel()
print(p)