This demo aims to give quick overview of the dispRity
package (v.r version_release
) for palaeobiology analyses of disparity, including disparity through time analyses.
This demo showcases a typical disparity-through-time analysis: we are going to test whether the disparity changed through time in a subset of eutherian mammals from the last 100 million years using a dataset from @beckancient2014.
In this example, we are going to use a subset of the data from @beckancient2014.
See the example data description for more details.
Briefly, this dataset contains an ordinated matrix of the Gower distance between 50 mammals based (BeckLee_mat50
), another matrix of the same 50 mammals and the estimated discrete data characters of their descendants (thus 50 + 49 rows, BeckLee_mat99
), a dataframe containing the ages of each taxon in the dataset (BeckLee_ages
) and finally a phylogenetic tree with the relationships among the 50 mammals (BeckLee_tree
).
The ordinated matrix will represent our full morphospace, i.e. all the mammalian morphologies that ever existed through time (for this dataset).
## Loading demo and the package data library(dispRity) ## Setting the random seed for repeatability set.seed(123) ## Loading the ordinated matrix/morphospace: data(BeckLee_mat50) data(BeckLee_mat99) head(BeckLee_mat50[,1:5]) dim(BeckLee_mat50) ## The morphospace contains 50 taxa and has 48 dimensions (or axes) ## Showing a list of first and last occurrences data for some fossils data(BeckLee_ages) head(BeckLee_ages) ## Plotting a phylogeny data(BeckLee_tree) plot(BeckLee_tree, cex = 0.7) axisPhylo(root = 140)
You can have an even nicer looking tree if you use the
strap
package!
if(!require(strap)) install.packages("strap") strap::geoscalePhylo(BeckLee_tree, cex.tip = 0.7, cex.ts = 0.6)
I greatly encourage you to follow along this tutorial with your very own data: it is more exciting and, ultimately, that's probably your objective.
What data can I use?
You can use any type of morphospace in any dataset form ("matrix"
, "data.frame"
). Throughout this tutorial, you we assume you are using the (loose) morphospace definition from @Guillerme2020: any matrix were columns are traits and rows are observations (in a distance matrix, columns are still trait, i.e. "distance to species A", etc.).
We won't cover it here but you can also use lists of matrices and list of trees.
How should I format my data for this tutorial?
To go through this tutorial you will need:
If you are missing any of these, fear not, here are a couple of functions to simulate the missing data, it will surely make your results look funky but it'll let you go through the tutorial.
WARNING: the data generated by the functions
i.need.a.matrix
,i.need.a.tree
,i.need.node.data
andi.need.FADLAD
are used to SIMULATE data for this tutorial. This is not to be used for publications or analysing real data! If you need a data matrix, a phylogenetic tree or FADLAD data, (i.need.a.matrix
,i.need.a.tree
andi.need.FADLAD
), you will actually need to collect data from the literature or the field! If you need node data, you will need to use ancestral states estimations (e.g. usingestimate_ancestral_states
from theCladdis
package).
## Functions to get simulate a PCO looking like matrix from a tree i.need.a.matrix <- function(tree) { matrix <- space.maker(elements = Ntip(tree), dimensions = Ntip(tree), distribution = rnorm, scree = rev(cumsum(rep(1/Ntip(tree), Ntip(tree))))) rownames(matrix) <- tree$tip.label return(matrix) } ## Function to simulate a tree i.need.a.tree <- function(matrix) { tree <- rtree(nrow(matrix)) tree$root.time <- max(tree.age(tree)$age) tree$tip.label <- rownames(matrix) tree$node.label <- paste0("n", 1:(nrow(matrix)-1)) return(tree) } ## Function to simulate some "node" data i.need.node.data <- function(matrix, tree) { matrix_node <- space.maker(elements = Nnode(tree), dimensions = ncol(matrix), distribution = rnorm, scree = apply(matrix, 2, var)) if(!is.null(tree$node.label)) { rownames(matrix_node) <- tree$node.label } else { rownames(matrix_node) <- paste0("n", 1:(nrow(matrix)-1)) } return(rbind(matrix, matrix_node)) } ## Function to simulate some "FADLAD" data i.need.FADLAD <- function(tree) { tree_ages <- tree.age(tree)[1:Ntip(tree),] return(data.frame(FAD = tree_ages[,1], LAD = tree_ages[,1], row.names = tree_ages[,2])) }
You can use these functions for the generating the data you need. For example
## Aaaaah I don't have FADLAD data! my_FADLAD <- i.need.FADLAD(tree) ## Sorted.
In the end this is what your data should be named to facilitate the rest of this tutorial (fill in yours here):
## A matrix with tip data my_matrix <- BeckLee_mat50 ## A phylogenetic tree my_tree <- BeckLee_tree ## A matrix with tip and node data my_tip_node_matrix <- BeckLee_mat99 ## A table of first and last occurrences data (FADLAD) my_fadlad <- BeckLee_ages
One of the crucial steps in disparity-through-time analysis is to split the full morphospace into smaller time subsets that contain the total number of morphologies at certain points in time (time-slicing) or during certain periods in time (time-binning). Basically, the full morphospace represents the total number of morphologies across all time and will be greater than any of the time subsets of the morphospace.
The dispRity
package provides a chrono.subsets
function that allows users to split the morphospace into time slices (using method = continuous
) or into time bins (using method = discrete
).
In this example, we are going to split the morphospace into five equal time bins of 20 million years long from 100 million years ago to the present.
We will also provide to the function a table containing the first and last occurrences dates for some fossils to take into account that some fossils might occur in several of our different time bins.
## Creating the vector of time bins ages time_bins <- rev(seq(from = 0, to = 100, by = 20)) ## Splitting the morphospace using the chrono.subsets function binned_morphospace <- chrono.subsets(data = my_matrix, tree = my_tree, method = "discrete", time = time_bins, inc.nodes = FALSE, FADLAD = my_fadlad)
The output object is a dispRity
object (see more about that here.
In brief, dispRity
objects are lists of different elements (i.e. disparity results, morphospace time subsets, morphospace attributes, etc.) that display only a summary of the object when calling the object to avoiding filling the R
console with superfluous output.
It also allows easy plotting/summarising/analysing for repeatability down the line but we will not go into this right now.
## Printing the class of the object class(binned_morphospace) ## Printing the content of the object str(binned_morphospace) names(binned_morphospace) ## Printing the object as a dispRity class binned_morphospace
These objects will gradually contain more information when completing the following steps in the disparity-through-time analysis.
Once we obtain our different time subsets, we can bootstrap and rarefy them (i.e. pseudo-replicating the data).
The bootstrapping allows us to make each subset more robust to outliers and the rarefaction allows us to compare subsets with the same number of taxa to remove sampling biases (i.e. more taxa in one subset than the others).
The boot.matrix
function bootstraps the dispRity
object and the rarefaction
option within performs rarefaction.
## Getting the minimum number of rows (i.e. taxa) in the time subsets minimum_size <- min(size.subsets(binned_morphospace)) ## Bootstrapping each time subset 100 times and rarefying them rare_bin_morphospace <- boot.matrix(binned_morphospace, bootstraps = 100, rarefaction = minimum_size)
Note how information is adding up to the
dispRity
object.
We can now calculate the disparity within each time subsets along with some confidence intervals generated by the pseudoreplication step above (bootstraps/rarefaction).
Disparity can be calculated in many ways and this package allows users to come up with their own disparity metrics.
For more details, please refer to the dispRity
metric section (or directly use moms
).
In this example, we are going to look at how the spread of the data in the morphospace through time. For that we are going to use the sum of the variance from each dimension of the morphospace in the morphospace. We highly recommend using a metric that makes sense for your specific analysis and for your specific dataset and not just because everyone uses it [@moms, @Guillerme2020]!
How can I be sure that the metric is the most appropriate for my morphospace and question?
This is not a straightforward question but you can use the test.metric
function to check your assumptions (more details here): basically what test.metric
does is modifying your morphospace using a null process of interest (e.g. changes in size) and checks whether your metric does indeed pick up that change.
For example here, let see if the sum of variances picks up changes in size but not random changes:
my_test <- test.metric(my_matrix, metric = c(sum, dispRity::variances), shifts = c("random", "size")) summary(my_test) plot(my_test)
We see that changes in the inner size (see @moms for more details) is actually picked up by the sum of variances but not random changes or outer changes. Which is a good thing!
As you've noted, the sum of variances is defined in test.metric
as c(sum, variances)
. This is a core bit of the dispRity
package were you can define your own metric as a function or a set of functions.
You can find more info about this in the dispRity
metric section but in brief, the dispRity
package considers metrics by their "dimensions" level which corresponds to what they output. For example, the function sum
is a dimension level 1 function because no matter the input it outputs a single value (the sum), variances
on the other hand is a dimension level 2 function because it will output the variance of each column in a matrix (an example of a dimensions level 3 would be the function var
that outputs a matrix).
The dispRity
package always automatically sorts the dimensions levels: it will always run dimensions level 3 > dimensions level 2 > and dimensions level 1. In this case both c(sum, variances)
and c(variances, sum)
will result in actually running sum(variances(matrix))
.
Anyways, let's calculate the sum of variances on our bootstrapped and rarefied morphospaces:
## Calculating disparity for the bootstrapped and rarefied data disparity <- dispRity(rare_bin_morphospace , metric = c(sum, dispRity::variances))
To display the actual calculated scores, we need to summarise the disparity object using the S3 method summary
that is applied to a dispRity
object (see ?summary.dispRity
for more details).
By the way, as for any R
package, you can refer to the help files for each individual function for more details.
## Summarising the disparity results summary(disparity)
The
summary.dispRity
function comes with many options on which values to calculate (central tendency and quantiles) and on how many digits to display. Refer to the function's manual for more details.
It is sometimes easier to visualise the results in a plot than in a table.
For that we can use the plot
S3 function to plot the dispRity
objects (see ?plot.dispRity
for more details).
## Graphical options quartz(width = 10, height = 5) ; par(mfrow = (c(1,2)), bty = "n") ## Plotting the bootstrapped and rarefied results plot(disparity, type = "continuous", main = "bootstrapped results") plot(disparity, type = "continuous", main = "rarefied results", rarefaction = minimum_size)
Nice. The curves look pretty similar.
Same as for the
summary.dispRity
function, check out theplot.dispRity
manual for the many, many options available.
Finally, to draw some valid conclusions from these results, we can apply some statistical tests.
We can test, for example, if mammalian disparity changed significantly through time over the last 100 million years.
To do so, we can compare the means of each time-bin in a sequential manner to see whether the disparity in bin n is equal to the disparity in bin n+1, and whether this is in turn equal to the disparity in bin n+2, etc.
Because our data is temporally autocorrelated (i.e. what happens in bin n+1 depends on what happened in bin n) and pseudoreplicated (i.e. each bootstrap draw creates non-independent time subsets because they are all based on the same time subsets), we apply a non-parametric mean comparison: the wilcox.test
.
Also, we need to apply a p-value correction (e.g. Bonferroni correction) to correct for multiple testing (see ?p.adjust
for more details).
## Testing the differences between bins in the bootstrapped dataset. test.dispRity(disparity, test = wilcox.test, comparison = "sequential", correction = "bonferroni") ## Testing the differences between bins in the rarefied dataset. test.dispRity(disparity, test = wilcox.test, comparison = "sequential", correction = "bonferroni", rarefaction = minimum_size)
Here our results show significant changes in disparity through time between all time bins (all p-values < 0.05). However, when looking at the rarefied results, there is no significant difference between the time bins in the Palaeogene (60-40 to 40-20 Mya), suggesting that the differences detected in the first test might just be due to the differences in number of taxa sampled (13 or 6 taxa) in each time bin.
The previous section detailed some of the basic functionalities in the dispRity
package but of course, you can do some much more advanced analysis, here is just a list of some specific tutorials from this manual that you might be interested in:
You can even come up with your own ideas, implementations and modifications of the package: the dispRity
package is a modular and collaborative package and I encourage you to contact me (guillert@tcd.e) for any ideas you have about adding new features to the package (whether you have them already implemented or not)!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.