knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Morphometry is a the study of shape and its variations, allowing quantitative comparisons of shape among groups. Shape analysis typically also includes a consideration of size, since shape typically varies with size (allometric shape variation). Therefore morphometrics can be a good way to examine growth and scaling.
The borealis
package and this vignette are designed for those beginning to do work and analysis in morphometry. Advanced undergraduate and graduate students, as well as investigators new to morphometry will hopefully find this tutorial helpful.
A typical workflow for geometric morphometrics (GMM) looks this this:
Morphometry in general nd GMM in particular is a rich discipline where people have developed a number of excellent analytical tools.
However, many of these tools can be difficult to use.
The borealis
package aims to provide R users with a set of relatively simple functions to perform GMM analysis, while maintaining an open-ended flexibility.
Importantly, GMM typically involves a number of data processing steps.
The borealis
package embraces the concept of data provenance. Each time a dataset is modified, functions in the package add to a list element recording important details about the execution and results of that step. This information can then be written to a markdown file as a report of the data's history and handling throughout the workflow.
borealis
also includes functions that produce publication-quality figures with minimal user effort.
If you've previously loaded the package, it is helpful to "unload" it before installing a newer version.
detach("package:borealis", unload = TRUE)
If you've never installed a package from Git Hub before, it may be necessary for you to install the package devtools
.
install.packages("devtools")
Install borealis
from GitHub as shown below.
devtools::install_github("aphanotus/borealis")
Load the package as any other.
library(borealis)
For landmark-based geometric morphometrics (GMM), the first step is to "digitize" coordinate positions in digital images of each specimen. ImageJ provides the easiest way to do this.
Before you begin, think carefully about the landmarks you should use to represent the shape of the structure you're interested in. Areas with a greater density of landmarks will capture more shape variation among specimens. However, you don't necessarily need dozens of landmarks. Important insights can be obtained from a simple set of landmarks.
Often, it's helpful to try a number of different landmark configurations on a small group of diverse specimens to see how well they perform. Importantly, you want landmarks that are present in all specimens and that can be placed with good confidence by everyone involved in the digitizing effort. I suggest writing up a good map of all the landmarks before proceeding with the digitization.
[cmnd]
M
to record data in a table in ImageJ.The first row of the spread sheet should provide column names.
Enter the X and Y values under columns named "x" and "y".
There must be a column giving specimen IDs, such as "specimen.ID", and one for scale.
Optionally, add other columns to record metadata.
The ID names, scale and other metadata, should be entered on the row for the first landmark of the specimen.
Any linear measurements accompanying the specimen should be treated as metadata.
Coordinate positions and metadata for each specimen should appear in a consecutive block of rows. In other words, in you have 3 landmarks: row 1 is the header, rows 2-4 are the data for specimen 1, rows 5-7 are the data for specimen 2, etc.
This spreadsheet format is convenient, since it allows rapid copy-and-paste of data from ImageJ with minimal formatting. It will also facilitate converting the coordinate positions and encoding the metadata into the tps
format in the next step.
![](https://i.imgur.com/hVU207q.png =400x)
tps
file formatThe tps
("thin plate spline") file format is one of the most commonly used formats among different GMM software (Rohlf 2015).
Creating a large tps
file manually can be difficult. A stray space or tab can prevent downstream software from handling it properly, and those issues can be very hard to track down.
For that reason, borealis
includes a simple function create.tps
to convert spreadsheet data, as described above, into a tps
file. It will also begin recording data provenance and embed metadata for each specimen.
The scale value can also be inverted, by setting invert.scale = TRUE
. The default for downstream functions is to apply scale values by multiplication. Typically this is appropriate when scale is recorded as unit distance (e.g. mm) per pixel. However, if scale is recorded in pixels per unit distance (e.g. pixels/mm) it will be appropriate to first invert the scaling factor before importing coordinate data.
create.tps( input.filename = "wings.csv", output.filename = "wings.tps", id.factors = c('species','caste','digitizer'), include.scale = TRUE, invert.scale = TRUE)
The file that's created will looks like this.
# ## TPS file creation # # Created by user `drangeli` with `borealis::create.tps` on Monday, 22 June 2020, 15:11:47 # # Input file: ~/Documents/3.research/2.Bombus.mouthparts/Bombus wing GMM/Bombus Wings - forewings.csv # # The dataset is 20 x 2 x 99 (*p* landmarks x *k* dimensions x *n* specimens) # # Metadata are encoded in specimen ID lines from the following factors: # - species # - caste # - digitizer # # Metadata separator: __ # # **Scale** included and **inverted** from the original dataset. # LM=20 1942 354.667 1650 336 1690 361.333 1747.333 377.333 1851.333 400 1867.333 440.667 1832.667 503.333 1819.333 507.333 1743.333 479.333 1674 486 1608.667 419.333 1488.667 387.333 1231.333 467.333 1471.333 518 1478 584.667 1478 594 1724.667 590 1447.333 650 1192.667 539.333 1479.333 679 ID=DRA190718-001__vag__W__JL SCALE=0.00702814773166532
tps
data into RTo start a workflow, you will typically begin from a tps
file.
The function read.tps
can do several early steps, but let's start by looking at a simple usage.
shapes <- read.tps("wings.tps")
This creates the object shapes
.
It is what R knows as a list
, a type of data object that can include many different elements, all accessible together. (Now the name "shapes" is pretty generic. If you will be doing multiple GMM analyses, it's a good idea to modify your code to use a more descriptive name.)
You can reveal the components of the list using the names
function.
names(shapes)
[1] "coords" "landmark.number" "specimen.number" "scaled" [5] "metadata" "provenance"
The coords
element is a 3-dimensional array of the coordinated positions. landmark.number
and specimen.number
are obvious.
The element scaled
is a logical value stating whether the coordinates
have already been scaled. If this is true (it is here), then you should
avoid applying any scale a second time!
The element metadata
is a data frame with metadata.
Each specimen is a row in this table, and ID names are the same in the metadata
row and coords
element. Specimens are also in the same order in those two elements. Finally the provenance
element contains data provenance, which will be added to as the workflow continues.
names(shapes$provenance)
[1] "create.tps" "read.tps"
You can look at the contents of the data provenance elements this way.
cat(unlist(shapes$provenance$read.tps))
## TPS data import Performed by user `drangeli` with `borealis::read.tps` on Tuesday, 23 June 2020, 17:40:40 Metadata were extracted from specimen ID lines for the following factors: - specimen.id - species - caste - digitizer
The borealis
package has a simple function for viewing the relative positions of landmarks.
The function is designed to be user-friendly.
It will automatically detect whether you're giving it coordinates for one specimen, an array with coordinates for multiple specimens, or
a list
object with such an array as one component.
So all of the commands below will produce the same plot.
landmark.plot(shapes$coords[,,1]) landmark.plot(shapes$coords) landmark.plot(shapes) landmark.plot(shapes, specimen.number = 1)
The function defaults to looking at the first specimen in the array, but you can also specify others.
landmark.plot(shapes, specimen.number = 2)
Often it's convenient to look at shape data with the landmarks connected in a way that reflects their biological meaning.
You can ask landmark.plot
to include these connections if you first define them as a matrix. In the example below, each connections is indicated by adjacent landmark numbers.
fw.links <- matrix(c(1,2, 1,5, 5,4, 4,3, 3,2, 5,6, 6,7, 7,8, 8,9, 9,4, 3,11, 11,12, 11,10, 9,10, 10,14, 14,15, 15,16, 16,18, 18,20, 16,17, 17,8, 12,13, 13,19, 14,13, 18,19, 2,12), ncol = 2, byrow = TRUE) landmark.plot(shapes, links = fw.links)
It's also frequently helpful to look at more than one specimen side-by-side. If you provide a vector to the specimen.number
argument, the plotting function will display multiple panels for those specimens.
landmark.plot(shapes, specimen.number = 1:4, links = fw.links)
The data as they exist in our shape
object right now are included in the borealis
package. You could choose to start the tutorial at this point using the following command.
data("Bombus.forewings", package = "borealis") shapes <- Bombus.forewings
Occasionally, a few specimens in a dataset are entered upside-down or "reflected" relative to the others. Or perhaps ImageJ recorded XY coordinates with the origin at the lower left, instead of the upper left. In these instances, it's helpful to go through all the specimens and orient them to have certain landmarks up or to the left.
The function align.reflect
will accomplish this, and update the data provenance to list all the reflections that are made.
shapes <- align.reflect(shapes, top.pt = 2, left.pt = 1, links = fw.links )
To save time, the reflection step can be included in the initial read.tps
statement.
shapes <- read.tps("wings.tps", reflect.specimens = TRUE, top.pt = 2, left.pt = 1, links = fw.links)
Joints can introduce nuisance variation in landmark-based geometric morphometrics. The borealis
function align.angle
rotates a subset of landmarks such that they come to a consistent angle. This step should be run before Procrustes alignment, and it is robust to differences in the relative position, orientation and size of specimens.
However, it's not always necessary to run this step. If the structure you're analyzing isn't jointed, then there's no need for it. So far in this vignette, we've been working with a bumblebee wing dataset that doesn't have obvious joints. So, let's switch to another sample dataset, mantis
, which includes simulated shapes for the femur and tibia of 100 mantis forelegs.
data("mantis", package = "borealis") # Define mantis.lines { x <- 1:16 mantis.lines <- matrix(c(x[-length(x)],x[-1]), ncol = 2) mantis.lines[10,] <- c(10,1) mantis.lines[15,] <- c(15,6) mantis.lines <- rbind(mantis.lines, matrix(c(5,11, 6,11, 13,16, 14,16), ncol = 2, byrow = TRUE)) } landmark.plot(mantis, specimen.number = 1:3, panels = c(1,3), links = mantis.lines)
As you can see, the angle of the femur and the tibia (the smaller segment) varies for each specimen. If you were to proceed into GPA with these data as they are, that angle would dominate the shape variance and obscure any signal from more interesting biological factors.
These angles can be standardized using the align.angle
function. It takes several arguments, angle.pts.1
, art.pt
, and angle.pts.2
, which define the angle on which we want to focus. art.pt
is the vertex of the angle and the articulation point for rotation. rot.pts
defines the landmarks to actually rotate. Each of these arguments can be the number of a single landmark or a vector of several landmark numbers. If multiple landmarks are provided, their mean position is used. Using multiple landmarks helps mitigate any variance individual landmarks may have. However, you should use your knowledge of anatomy to choose landmarks that make sense for your subject.
mant.2 <- align.angle(mantis, art.pt = 11, angle.pts.1 = c(1:10), angle.pts.2 = c(12:15), rot.pts = c(12:16) )
Angle alignment reduced variance in centroid size-scaled landmark distances by 20.4%.
By default, the function will plot overlaid histograms using the distribution of variance in inter-landmark distances, scaled by each specimen's centroid size. This can be used to determine how well the alignment improves the consistency of landmark positions, with lower values indicating greater consistency.
landmark.plot(mant.2, specimen.number = 1:3, panels = c(1,3), links = mantis.lines)
Angle alignment reduced variance in centroid size-scaled landmark distances by 20.4%.
The angle that is enforced on all specimens is based on the mean angle of all specimens, by default. However, arguments to reference.specimen
can indicate a subset of specimens on which to base the reference angle. You can also add angular value to all specimens using the angle
argument.
Generalized Procrustes analysis minimizes the variance in distances between the landmarks of different specimens. It does this by translating, rotating and re-sizing each specimen's coordinates, but not changing their relative distances to one another. In this way, GPA makes shapes comparable and removes trivial issues introduced by their original imaging. GPA also retains size information as centroid size.).
wing.gpa <- align.procrustes(shapes)
What's happening behind the scenes is that borealis
relies on the function gpagen
in the R package geomorph
. The output of the align.procrustes
contains the same list elements produced by geomorph::gpagen
. However it also retains any list elements from the input and add to the data provenance element.
True landmarks are based on reliable anatomical features with unambiguous positions (Zelditch et al. 2012). However, it's also possible to include "semilandmarks", positions that are defined by a line or curve. For example, the edge of a limb does not have any true, fixed anatomical landmarks, but one or more semilandmarks can be defined along the edge to capture its shape. Semilandmarks are defined relative to their neighbors (which can also be semilandmarks). During GPA, semilandmarks are allowed to "slide" in the axis determined by their defining neighbors, so that they are positions equidistant from those neighbors in that axis. But their deviation from that axis is retained, in order to capture that shape variation. Ideally, semilandmarks are initially placed very close to their ultimate position. But sliding them during GPA helps optimize the alignment of specimens, and reduces variance in shapes that is not biologically meaningful (Zelditch et al. 2012).
In the bumblebee wing dataset, all landmarks are true landmarks, defined by the intersections of veins.
However, to provide an example, let's consider landmarks 6 and 15 to be sliding semilandmarks. To define the landmarks, gpagen
and align.procrustes
will take a matrix with 3 columns. Each row defines one semilandmark. The semilandmark itself is in the middle column. The flanking neighbors, which define the axis for sliding, are in columns 1 and 3.
semiLMs <- matrix( c(5,6,7, 14,15,16), ncol = 3, byrow = TRUE ) semiLMs
[,1] [,2] [,3] [1,] 5 6 7 [2,] 14 15 16
The matrix defining semilandmarks is then included in the call to align.procrustes
with the curves
argument.
wing.gpa <- align.procrustes(shapes, curves = semiLMs)
The use of semilandmarks and their definitions is recorded in the data provenance.
After initial GPA, it's typically important to remove any obvious outliers. These may be specimens where coordinate positions were recorded incorrectly. There presences in the dataset can skew signals that are actually of interest.
align.procrustes
facilitates interactive outlier detection and removal. Just include the argument outlier.analysis = TRUE
. The function will run geomorph::plotOutliers
, then display
warp grids for the most extreme shapes and prompt the user to remove some number of them. This can be done iteratively.
wing.gpa <- align.procrustes(shapes, outlier.analysis = TRUE)
In this case, removing the two most extreme outliers looks prudent.
The final GPA plot looks good: The individual specimen landmarks in gray are very close to the mean landmarks shown in black.
The geomorph
package (Adams et al. 2020) provides a powerful set of tools for GMM analysis.
Many functions require shape, size and other metadata to be in a data structure called a geomorph.data.frame
.
In order to retain data provenance and other elements of our data object, we can use the function listed.gdf
to convert the gpagen
and metadata
elements of our list into a geomorph.data.frame
while still keeping it within a list with the other elements.
wing.gpa <- listed.gdf(wing.gpa)
We can then take a look at the new data structure. This can be done using the names
function, as above. However, for complex data structures, the str
function is often convenient. The output shows the nested relationships among list elements and previews the contents of each one.
str(wing.gpa)
List of 5 $ gdf :List of 6 ..$ specimen.id: chr [1:97] "DRA190815-001" "DRA190815-002" "DRA190815-003" "DRA190815-004" ... ..$ species : chr [1:97] "vag" "imp" "imp" "imp" ... ..$ caste : chr [1:97] "W" "W" "W" "W" ... ..$ digitizer : chr [1:97] "JL" "JL" "JL" "JL" ... ..$ coords : num [1:20, 1:2, 1:97] -0.405 -0.0319 -0.0621 -0.1388 -0.2011 ... .. ..- attr(*, "dimnames")=List of 3 .. .. ..$ : chr [1:20] "1" "2" "3" "4" ... .. .. ..$ : chr [1:2] "X" "Y" .. .. ..$ : chr [1:97] "DRA190815-001" "DRA190815-002" "DRA190815-003" "DRA190815-004" ... ..$ Csize : Named num [1:97] 7.02 6.91 6.54 7.2 6.93 ... .. ..- attr(*, "names")= chr [1:97] "DRA190815-001" "DRA190815-002" "DRA190815-003" "DRA190815-004" ... ..- attr(*, "class")= chr "geomorph.data.frame" $ landmark.number: int 20 $ specimen.number: int 97 $ scaled : logi TRUE $ provenance :List of 8 ..$ create.tps : chr "## TPS file creation \n\nCreated by user `drangeli` with `borealis::create.tps` on Tuesday, 23 June 2020, 23:50"| __truncated__ ..$ read.tps : chr "## TPS data import\n\nPerformed by user `drangeli` with `borealis::read.tps` on Tuesday, 23 June 2020, 23:51:33"| __truncated__ ..$ align.reflect : chr "## Reflection alignment\n\nPerformed by user `drangeli` with `borealis::align.reflect` on Saturday, 10 October "| __truncated__ ..$ align.procrustes : chr "## Generalized Procrustes analysis\n\nPerformed by user `drangeli` with `borealis::align.procrustes` on Saturda"| __truncated__ ..$ listed.gdf : chr "## Geomorph data frame conversion\n\nPerformed by user `drangeli` with `borealis::listed.gdf` on Saturday, 10 O"| __truncated__
Later in the analysis we will have reasons to work with a subset of the full dataset. As data structures become complex, subsetting them becomes difficult. For example, there is no convenient way to subset a geomorph.data.frame
using traditional R bracketing.
The function subsetGMM
provides a shortcut for subsetting geomorph.data.frame
objects and data structures produced by the borealis
functions. As always data provenance entries record these changes. In the example below, we will subset only the bumblebee wing specimens that present workers.
x <- which(wing.gpa$gdf$caste=="W") workers <- subsetGMM(wing.gpa, specimens = x)
It's also possible to use subsetGMM
to select a subset of the landmarks in a data structure.
distal.wing <- subsetGMM(wing.gpa, landmarks = 1:10) landmark.plot(distal.wing)
It is possible you may need to make custom modifications to your shape dataset. In order to maintain data provenance in these circumstances the function add.provenance
allows you to document any custom data manipulation.
wing.gpa$gdf$digitizer[1] <- "FJ" wing.gpa <- add.provenance( wing.gpa, name="error.correction", title = "Corrected error in the metadata", text = "Upon checking lab notes, an error needed to be corrected. The digitizer of specimen 1 appears to have been FJ, not JL." ) cat(wing.gpa$provenance$error.correction)
## Corrected error in the metadata Performed by user `drangeli` on Saturday, 10 October 2020, 10:15:50 Upon checking lab notes, an error needed to be corrected. The digitizer of specimen 1 appears to have been FJ, not JL.
After creation of a geomorph.data.frame
, a dataset is usually done with pre-processing and is ready for analysis. That means that the data will not change from this point forward. If so, then a final report on data provenance can be generated from the records kept in our data object.
write.provenance(wing.gpa, output.filename = "~/Desktop/wing.shape.provenance.md", title = "Preliminary wing shape data provenance")
This creates a file in markdown format. These files can be treated as plain text and viewed directly, as below.
# Preliminary wing shape data provenance ## TPS file creation Created by user `drangeli` with `borealis::create.tps` on Tuesday, 23 June 2020, 23:50:46 Input file: /Users/drangeli/Documents/3.research/2.Bombus.mouthparts/Bombus wing GMM/Bombus Wings - forewings.csv The dataset is 20 x 2 x 99 (*p* landmarks x *k* dimensions x *n* specimens) Metadata are encoded in specimen ID lines from the following factors: - species - caste - digitizer
Several free markdown viewers and editors also exist, such as Typora.
To visualize variance in shapes, we perform a form of ordination known as Kendall's tangent space projection. Mathematically, this is similar to principal component analysis (PCA).
wing.pca <- gm.prcomp(wing.gpa$gdf$coords)
To examine the eigenvalues and proportional variance for each dimension in the analysis, run summary
.
summary(wing.pca)
Ordination type: Principal Component Analysis Centering and projection: OLS Number of observations 97 Number of vectors 40 Importance of Components: Comp1 Comp2 Comp3 Comp4 Comp5 Eigenvalues 0.0002302972 0.0001681885 8.952373e-05 6.013661e-05 4.486361e-05 Proportion of Variance 0.2694561165 0.1967867258 1.047460e-01 7.036203e-02 5.249206e-02 Cumulative Proportion 0.2694561165 0.4662428423 5.709889e-01 6.413509e-01 6.938430e-01
While it can be useful to know how to work with the raw output from PCA analysis, often people are most interested in plotting the first two PC axes. PCA is particularly helpful for understanding differences in shapes among groups. And group membership can be mapped onto the shape space plot.
plot(wing.pca, col = as.factor(wing.gpa$gdf$species)) sp <- as.factor(wing.gpa$gdf$species) legend("topright", levels(sp), col = c(1:length(levels(sp))), cex = 0.9, pch = 1, box.lwd = 0, bg="transparent")
This is okay, but it's not very pretty. The borealis
package has a function to plot these values using ggplot2
with colorblind-friendly palettes.
ggGMMplot(wing.pca, group = wing.gpa$gdf$species, group.title = 'species', convex.hulls = TRUE, include.legend = TRUE)
It can also be helpful to examine a version of the morphospace plot that provides example shapes. This is possible too, using ggGMMplot
with the argument backtransform.examples = TRUE
.
It then requires a reference shape, which can be supplied with the geomorph
function mshape
, which calculates a mean shape.
Since shape differences can be subtle, it can be helpful to apply a magnification factor using the bt.shape.mag
argument.
ggGMMplot(wing.pca, group = wing.gpa$gdf$species, group.title = 'species', convex.hulls = TRUE, backtransform.examples = TRUE, ref.shape = mshape(wing.gpa$gdf$coords), shape.method = "TPS", bt.shape.mag = 3, bt.links = fw.links)
If different groups occupy similar areas of morphospace, an important follow-up question is whether that similarity is due to a shared evolutionary history or convergence (perhaps due to similar ecological influences). One way to examine this question is to look at ordination of shape data along with some representation of group relationships. Let's look at two approaches.
We have a good phylogeny for species of bumblebees in our sample dataset.
The tree is the consensus from a RAxML analysis of five genes, sequenced by Cameron et al. (2007), for 26 species found in northeastern North America. This tree is includes as data in the borealis
package.
First, it's necessary for us to find mean shapes for each species. If you have a tree with tips that represent each specimen in your shape dataset, then this step won't be necessary.
Because shapes may differ by caste, we'll limit this analysis to workers. We'll subset the shape data using using the function subsetGMM
, as shown above.
x <- which(wing.gpa$gdf$caste=="W") workers <- subsetGMM(wing.gpa, specimens = x)
Next, the function coords.subset
will split the coordinate shape data into groups, in this example, by species.
coords.by.species <- coords.subset(workers$gdf$coords, group = workers$gdf$species)
This produces a list
with elements named for each species.
Each element is an array of shape data for the specimens in that species.
names(coords.by.species)
[1] "bimac" "bor" "ferv" "imp" "tern" "terri" "vag"
Next, we use the base R function lapply
to apply a function to each element of the list. The function mshape
can find the mean shape for each element.
mshape.by.species <- lapply(coords.by.species, mshape)
The output is a new list, with elements named for each species, but now each element contains just one shape representing the average for that species.
Ideally, we'd be all set at this point. However, this object still exists as a list
. The functions we'll use next will expect their input to be a 3-dimensional array
. So we'll use some base R commands to reformat the data without substantively changing it.
species.names <- names(mshape.by.species) mshape.by.species <- array( data = unlist(mshape.by.species), dim = c(dim(mshape.by.species[[1]])[1], 2, length(species.names)), dimnames = list(NULL,NULL,species.names) )
Now that the shape data is ready, we need to prepare the phylogeny. Start by making a copy of the tree object that's included with borealis
, and plotting it out.
data("Bombus.tree", package = "borealis") btree <- Bombus.tree plot(btree)
We'll need to pare it down to only the taxa included in our shape analysis. The names will also need to be changed to match the abbreviated species names we've used in the shape dataset.
Doing some of these things will require tools from the phytools
package.
(If necessary run install.packages("phytools")
).
The Bombus.tree
dataset includes species name abbreviations in the element code.name
, which we can copy to the tip.label
element, where functions will look for taxon names.
btree$tip.label <- btree$code.name # Check that species.names (our abbreviations in the shape data) are all present in the tree species.names %in% btree$tip.label library(phytools) btree <- keep.tip(btree, species.names) plot(btree)
Now we have a phylogeny and a set of mean shapes that can be combined for phylogenetic PCA.
pca.w.phylo <- gm.prcomp(mshape.by.species, phy = btree) plot(pca.w.phylo, phylo = TRUE, main = "PCA with phylogeny")
The space plotted here is the first two principal component axes. It's different from the plot we saw above, because each species is represented by only one point. The points are connected by branches that reflect their evolutionary history, as described by the phylogenetic tree above. Based on this result, we might conclude that "bor" and "ferv" (Bombus borealis and B. fervidus) have wing shapes different from the other species in the dataset, and that those shape differences are coincident with a shared ancestry of those two species. In contrast, while "imp" and "terri" (B. impatiens and B. terricola) have similar wing shapes, they are relatively distant in their relatedness. An important caveat here is that the output of the PCA here is exactly the same as before we considered the phylogeny. In this method, the phylogeny is used for plotting only.
geomorph
provides a more sophisticated way to represent shape and relatedness in a plot like this. As described by its authors, phylogenetically-aligned PCA (PaCA) "...provides an ordination that aligns phenotypic data with phylogenetic signal, by maximizing variation in directions that describe phylogenetic signal, while simultaneously preserving the Euclidean distances among observations in the data space." (Kaliontzopoulou 2020)
paca <- gm.prcomp(mshape.by.species, phy = btree, align.to.phy = TRUE) plot(paca, phylo = TRUE, main = "Phylogenetically-aligned PCA")
In this case, the two methods provide very similar results. However, it will be wise to try both and compare the results.
Before we discuss modeling, I want to offer a disclaimer. This subject is complex. I urge everyone to take a good statistics course; preferably several. That said, let's briefly review what modeling is and then discuss how it can be done with GMM data.
All statistical models ask some version of a simple question: If a set of measurements has some variance, to what degree do certain factors explain that variance? Many different statistical methods exist to ask this question with different kinds of data.
Morphometric data sets are multivariate. In a traditional ANOVA, one dependent variable is measured and modeled based on one or more factors. The analysis of shape data however is complicated by the fact that the dependent variable (shape) consists of a constellation of coordinate positions and their relative distances from one another. Therefore, specialized multivariate analysis methods are necessary.
geomorph
provides functions for model construction and comparison (Collyer 2020).
In arguments to the function procD.lm
, you describe a model using the ~
(tilda) character. I think of this as meaning "y is a function of x".
Below we will model shapes (coords
) as a function of the log of centroid size (log(Csize)
). These data all come from the wing.gpa$gdf
geomorph data frame.
i <- 1e3-1 # number of iterations size.model <- procD.lm(coords ~ log(Csize), data = wing.gpa$gdf, iter = i)
The model object can then be examined. By passing it to the anova
function, we get an ANOVA table where we can examine the statistics and p-values for individual factors in the model.
size.model
anova(size.model)
Linear Model fit with lm.rrpp Number of observations: 97 Number of dependent variables: 40 Data space dimensions: 40 Sums of Squares and Cross-products: Type I Number of permutations: 1000
Analysis of Variance, using Residual Randomization Permutation procedure: Randomization of null model residuals Number of permutations: 1000 Estimation method: Ordinary Least Squares Sums of Squares and Cross-products: Type I Effect sizes (Z) based on F distributions Df SS MS Rsq F Z Pr(>F) log(Csize) 1 0.005193 0.0051933 0.0633 6.4194 4.142 0.001 ** Residuals 95 0.076855 0.0008090 0.9367 Total 96 0.082049 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Call: procD.lm(f1 = coords ~ log(Csize), iter = i, data = wing.gpa$gdf)
From these results, we can conclude that wing size is a strong predictor of wing shape.
Let's try some more complex models.
species.model <- procD.lm(coords ~ log(Csize) + species, data=wing.gpa$gdf, iter=i) sp.caste.model <- procD.lm(coords ~ log(Csize) + species + caste, data=wing.gpa$gdf, iter=i) anova(species.model) anova(sp.caste.model)
Df SS MS Rsq F Z Pr(>F) log(Csize) 1 0.005193 0.0051933 0.06330 10.665 5.1665 0.001 ** species 7 0.034002 0.0048575 0.41442 9.975 10.6790 0.001 ** Residuals 88 0.042853 0.0004870 0.52229 Total 96 0.082049
Df SS MS Rsq F Z Pr(>F) log(Csize) 1 0.005193 0.0051933 0.06330 10.8000 5.1911 0.001 ** species 7 0.034002 0.0048575 0.41442 10.1016 10.7222 0.001 ** caste 1 0.001018 0.0010181 0.01241 2.1173 2.1400 0.025 * Residuals 87 0.041835 0.0004809 0.50988 Total 96 0.082049
So far, all these models have factors that seem to be strong predictors of wing shape. But are the more complex models really better than a simple one? In modeling, there is a danger of false confidence in highly complex models. A model with many factors can partition variance so finely, that it appears to fit the data very well. Every factor have have a very low p-value. But is a more complex model a meaningful improvement over a simple one?
geomorph
allows comparisons among models by simply passing multiple model objects to the anova
function. (As an aside, this is not the same "anova" function that is used by some other R packages. When you load geomorph
it will overwrite those others for the duration of your R session, or until you load a package with another function called anova
.)
The first model passed to anova
is taken as the null model. Each of the others is compared to the null.
anova(size.model, species.model, sp.caste.model)
Analysis of Variance, using Residual Randomization Permutation procedure: Randomization of null model residuals Number of permutations: 1000 Estimation method: Ordinary Least Squares Effect sizes (Z) based on F distributions ResDf Df RSS SS MS Rsq F Z P Pr(>F) coords ~ log(Csize) (Null) 95 1 0.076855 0.00000 coords ~ log(Csize) + species 88 7 0.042853 0.034002 0.0048575 0.41442 9.9750 10.679 0.001 coords ~ log(Csize) + species + caste 87 8 0.041835 0.035021 0.0043776 0.42683 9.1036 10.899 0.001 Total 96 0.082049
This result suggests that both of the more complex models are improvements over the simple null model that explains wing shape by size alone. The two more complex models can then be compared directly to one another.
anova(species.model, sp.caste.model)
ResDf Df RSS SS MS Rsq F Z P Pr(>F) coords ~ log(Csize) + species (Null) 88 1 0.042853 0.000000 coords ~ log(Csize) + species + caste 87 1 0.041835 0.0010181 0.0010181 0.012409 2.1173 2.14 0.025 Total 96 0.082049
In this case, the model that adds "caste" as a predictive factor does not appear to be much better than the model with only size and species as factors.
In the examples above, we compared models that included multiple factors. So our "null" model in each comparison simply needed to omit the factor of interest (e.g. caste
in the last example). What if we wanted to make a comparison and our model only had one factor? What would be our null model then? We can actually define a silly null model that contains no factors at all: coords ~ 1
null.model <- procD.lm(coords ~ 1, data=wing.gpa$gdf, iter=i)
As with any ANOVA, these tests tell us that species is a factor that explains wing shape, at least in part. But it does not tell us which species differ from one another.
Answering that question requires what are broadly called post hoc tests.
There are many ways to do post hoc tests, but the simplest and one of the most common is to examine all pairwise contrasts. This can be dome using a function provided in the RRPP
package, which underlies much of geomorph
.
Right now our dataset includes some species with too few samples for the pairwise
function to operate. You can examine sample sizes like this.
c(with(wing.gpa$gdf, by(species, species, length)))
bimac bor ferv imp san tern terri vag 23 3 2 28 1 26 2 12
So, first we'll subset the data to just the four species that have good representation. Then we'll create a model similar to the one we built earlier, with size and species as predictive factors.
common.sp <- subsetGMM(wing.gpa, specimens = (wing.gpa$gdf$species %in% c("bimac","imp","tern","vag")) ) common.sp.model <- procD.lm(coords ~ log(Csize) + species, data=common.sp$gdf, iter=i) anova(common.sp.model)
Df SS MS Rsq F Z Pr(>F) log(Csize) 1 0.004258 0.0042576 0.06657 9.4733 4.6087 0.001 ** species 3 0.021944 0.0073146 0.34312 16.2753 9.3920 0.001 ** Residuals 84 0.037752 0.0004494 0.59031 Total 88 0.063953
As before, both factors are significant predictors of wing shape.
We should also construct a null model, that omits the factor we're interested in. In this case that's species
. -- The pairwise
function we'll use later can guess at this, but it's good practice to be explicit about defining the null model. It can help avoid incorrect assumptions.
common.size.model <- procD.lm(coords ~ log(Csize), data=common.sp$gdf, iter=i)
Now we can conduct the post hoc pairwise comparisons.
common.sp.pw <- pairwise(fit = common.sp.model, fit.null = common.size.model, groups = common.sp$gdf$species) summary(common.sp.pw)
Pairwise comparisons Groups: bimac imp tern vag RRPP: 1000 permutations LS means: Vectors hidden (use show.vectors = TRUE to view) Pairwise distances between means, plus statistics d UCL (95%) Z Pr > d bimac:imp 0.02644867 0.01059573 10.404643 0.001 bimac:tern 0.02364860 0.01426434 5.828100 0.001 bimac:vag 0.02433783 0.01368575 6.638058 0.001 imp:tern 0.03171487 0.01286605 10.051005 0.001 imp:vag 0.04044576 0.01301518 13.426703 0.001 tern:vag 0.02323543 0.01431901 5.549254 0.001
In this example, all of the species have significant differences from one another in their contribution to wing shape. Importantly, these effects are independent of the influence of wing size on shape. Because the model includes a size factor (log(Csize)
), that influence is already accounted for. We've asked the pairwise
function to make comparisons among species, using the argument groups = common.sp$gdf$species
.
If we wanted to compare the influence of species, controlling for the influence of caste as well, we could do so this way.
common.caste.model <- procD.lm(coords ~ log(Csize) + caste, data=common.sp$gdf, iter=i) common.sp.caste.model <- procD.lm(coords ~ log(Csize) + species + caste, data=common.sp$gdf, iter=i) common.sp.caste.pw <- pairwise(fit = common.sp.caste.model, fit.null = common.caste.model, groups = common.sp$gdf$species) summary(common.sp.caste.pw)
Pairwise distances between means, plus statistics d UCL (95%) Z Pr > d bimac:imp 0.02626333 0.01297581 8.191782 0.001 bimac:tern 0.02437228 0.01764947 4.328084 0.001 bimac:vag 0.02480890 0.01364798 6.801108 0.001 imp:tern 0.03298336 0.01334316 10.156108 0.001 imp:vag 0.04079689 0.01377951 12.566232 0.001 tern:vag 0.02291126 0.01554668 4.736486 0.003
Again, all species appear significantly different from one another in their contributions to wing shape. And these effects are independent of the influences of wing size and caste, which have already been accounted for in the model.
This example absolutely requires you to define a null model for the pairwise
function.
If you try just passing the full model, the function will guess at the null model and may be wrong.
summary(pairwise(common.sp.caste.model, groups = common.sp$gdf$species))
d UCL (95%) Z Pr > d bimac:imp 0.02626333 0.02980956 -0.4516418 0.665 bimac:tern 0.02437228 0.02801701 -0.4260401 0.653 bimac:vag 0.02480890 0.02921824 -0.2716658 0.606 imp:tern 0.03298336 0.03572867 0.2165722 0.410 imp:vag 0.04079689 0.04448370 -0.1000897 0.541 tern:vag 0.02291126 0.02770863 -0.7676849 0.784
What? How can there no significant pairwise differences? -- The problem here is that pairwise
has assumed the wrong null model. If you ever do want to use the short-cut of not specifying a null model, you can check what model the function will assume as its null.
reveal.model.designs(common.sp.caste.model)
Reduced Full log(Csize) 1 log(Csize) species log(Csize) log(Csize) + species caste log(Csize) + species log(Csize) + species + caste <- Null/Full inherent in pairwise
As you can see, the "Reduced" model that is assumed here is log(Csize) + species
.
Since we want to test differences among species, it's essential that we take as our null model one that does not have species in it as a factor. So using this model is what leads to a different (and in this case inappropriate) result.
And while we're talking about ways that these statistical functions can be misused, there's another important issue we should cover in the next section!
Remember, modeling is about assigning variance in the dependent variable (or shape coordinates) to the factors in the model. There are different ways to do this. The modeling functions in the geomorph
and RRPP
packages use what's called a sequential sum of squares by default. This method examines the main effect each factor has on the response variable (or shapes), in the order that they're named in the model, before considering any interaction terms. This is generally considered to be the most robust and versatile method. (Other types are used in more specialized situations.)
In practical terms however, this means that building your model with the terms in a different order can sometimes produce different results. For example, in the last section we modeled wing shape as a factor of wing size, species and caste. Let's look at that model again, and compare it against a model that places caste
before species.
anova(common.sp.caste.model)
Df SS MS Rsq F Z Pr(>F) log(Csize) 1 0.004258 0.0042576 0.06657 9.7017 4.6556 0.001 ** species 3 0.021944 0.0073146 0.34312 16.6678 9.4651 0.001 ** caste 1 0.001328 0.0013277 0.02076 3.0254 3.0695 0.003 ** Residuals 83 0.036424 0.0004388 0.56954 Total 88 0.063953
common.caste.sp.model <- procD.lm(coords ~ log(Csize) + caste + species, data=common.sp$gdf, iter=i) anova(common.caste.sp.model)
Df SS MS Rsq F Z Pr(>F) log(Csize) 1 0.004258 0.0042576 0.06657 9.7017 4.6556 0.001 ** caste 1 0.002269 0.0022685 0.03547 5.1693 3.7851 0.001 ** species 3 0.021003 0.0070010 0.32841 15.9531 9.3146 0.001 ** Residuals 83 0.036424 0.0004388 0.56954 Total 88 0.063953
Same factors; different order. Notice that the F statistics and p values (Pr > d
in this table) are slightly different. In this example, it wouldn't radically change your interpretation, but with a different dataset the difference might be more extreme.
So what order should you use? In general, you should place what you consider to be the most influential factors first. If you're not sure, it's worth examining both orders. If the outcomes differ a lot, then that may be a sign the factors have a strong interaction (e.g. caste influences wing shape, but differently in different species). If so, you should test a model that includes such as interaction, and see how it stacks up against the simpler model.
We can also model the influence of interactions. For example, we already have support for the hypotheses that species and caste influence shape independently. But what if certain combinations of those factors have a shape that is unique and different from the influences of each factor alone? We can examine that by comparing models with and without an interaction term.
common.spXcaste.model <- procD.lm(coords ~ log(Csize) + species * caste, data=common.sp$gdf, iter=i) anova(common.sp.caste.model, common.spXcaste.model)
Df SS MS Rsq F Z Pr(>F) log(Csize) 1 0.004258 0.0042576 0.06657 10.0677 4.7297 0.001 ** species 3 0.021944 0.0073146 0.34312 17.2965 9.5719 0.001 ** caste 1 0.001328 0.0013277 0.02076 3.1396 3.1617 0.001 ** species:caste 2 0.002170 0.0010849 0.03393 2.5655 3.4042 0.002 ** Residuals 81 0.034254 0.0004229 0.53562 Total 88 0.063953
ResDf Df RSS SS MS Rsq F Z P Pr(>F) coords ~ log(Csize) + species + caste (Null) 83 1 0.036424 0.000000 coords ~ log(Csize) + species * caste 81 2 0.034254 0.0021699 0.0010849 0.033929 2.5655 3.4042 0.002 Total 88 0.063953
As we suspected, in this case, there is a significant interaction of species
and caste
that significantly improves the model fit. (Careful readers will note that this might not be a solid conclusion, since the sample sizes for non-worker castes are not nearly as large as the workers. This lack of balanced sampling could impact the analysis.)
You can specify an interaction in a model using the *
symbol, as in the example above, or in general terms as Y ~ A * B
. However, you can also name the interaction term specifically using :
. In general, this looks like Y ~ A + B + A:B
. Or we could modify the call above to become...
common.spXcaste.model <- procD.lm(coords ~ log(Csize) + species + caste + species:caste, data=common.sp$gdf, iter=i)
...it will accomplish the same thing! This :
notation can be useful if you have a complex model with many factors where you want to be very explicit above the interactions you test!
Size and shape are often correlated. This means that as specimens get larger, their shape changes in a predictable way. However, we may want to ask if that relationship is consistent for all of the groups in our dataset. Do species all share a common allometry or do some species have unique allometries? The way to approach this statistically is to compare a model, where size and species independently influence shape, to one that allows for an interaction between those two terms. The function plotAllometry
also provides a way to visualize these relationships.
species.unique.model <- procD.lm(coords ~ log(Csize) * species, data=wing.gpa$gdf, iter=i) plotAllometry(fit = species.unique.model, size = log10(wing.gpa$gdf$Csize), col = as.factor(wing.gpa$gdf$species), xlab = "log10 centroid size")
We can also use the borealis
function gg.scaling.plot
to look at the relationship between size and wing shape (represented by PC1 values)
gg.scaling.plot( x = log10(wing.gpa$gdf$Csize), y = wing.pca$x[,1], group=wing.gpa$gdf$species, group.title = "species", xlab = "log10 wing centroid size", ylab = "wing shape PC1", include.legend = TRUE, groups.trendlines = TRUE, fixed.aspect = FALSE )
Note that the two plots above are fairly different. This is because we used PC1 values in the gg.scaling.plot
, while plotAllometry
is displaying fitted values based on the model of unique allometries.
Regardless of which plot we focus on, the differences in slopes among different species indicate that there might be some variation in allometries. We can test that hypothesis statistically, by comparing models.
anova(species.model, species.unique.model)
ResDf Df RSS SS MS Rsq F Z P Pr(>F) coords ~ log(Csize) + species (Null) 88 1 0.042853 0.000000 coords ~ log(Csize) * species 82 6 0.038875 0.0039784 0.00066307 0.048489 1.3987 1.7219 0.043 Total 96 0.082049
The results here are inconclusive. The benefit in model fit gained by adding unique influences of size for each species produce only a marginal difference.
However, it's worth noting that this dataset is not ideal to answer this question. The sampling is very uneven across species. Ideally, equal numbers of species would be included and represent similar sizes.
Size often has a dominant effect on shape. Therefore it can be helpful to examine aspects of shape that remain when the effects of allometry are subtracted out.
allometry.corrected.pca <- gm.prcomp(size.model$residuals) ggGMMplot(allometry.corrected.pca, group = wing.gpa$gdf$species, group.title = 'species', convex.hulls = TRUE, include.legend = TRUE)
After this correction, we see more separation of the groups in the center, which may reveal aspects of their shape that are different, independent of differences in their size.
One general feature of ANOVA is that is assumes the levels of each factor are independent of one another.
For example, if the species
factor has 8 different values, ANOVA assumes that there's no connection between them.
However, we know that some species share ancestors more recently than others.
Therefore, this assumption of independence doesn't hold when compared groups with an evolutionary history.
Worse perhaps, is the fact that the degree of relatedness (and therefore the extent of non-independence) will vary among individual species.
Thankfully, methods have been developed to incorporate knowledge of species relationships into an ANOVA. geomoprh
includes a function, procD.pgls
, which will build models similar to procD.lm
while also accounting for phylogeny. What's happening under the hood here is that the tree is used in the permutations that generate the test statistics. So while the output is similar with procD.pgls
and procD.lm
, the former is more reliable if you have a good phylogeny available.
One limitation of this approach is that the model can only include one representative from each species (or each tip of the tree, at whatever taxonomic level you're working with).
For this example, let's examine the influence of wing size on shape, in a phylogenetic context. We can use the same shape data we created earlier, mshape.by.species
, which contains mean worker wing shapes for the 7 most heavily sampled species. The object btree
also contains the phylogeny for those species. We can calculate the mean centroid wing size from the larger dataset, then combine these elements into a new geomorph.data.frame
.
species.gdf <- geomorph.data.frame( coords = mshape.by.species, Csize = c(by(workers$gdf$Csize, workers$gdf$species, mean)), tree = btree ) size.pgls <- procD.pgls(coords ~ log(Csize), phy = tree, data = species.gdf, iter = i) anova(size.pgls)
Df SS MS Rsq F Z Pr(>F) log(Csize) 1 0.016836 0.0168363 0.25317 1.695 1.1565 0.135 Residuals 5 0.049665 0.0099331 0.74683 Total 6 0.066502
In this analysis size does not have a significant influence on wing shape. If you recall, when we looked at the output of anova(size.model)
, size was a significant influence on shape. To interpret these results, it's important to remember that earlier, we had a larger dataset and including size as a factor tested whether individual size differences influenced individual wing shapes. Here we're asking a slightly different question: whether species-level size influence the mean worker wing shape of the species.
You can build multiple models with procD.pgls
and compare them in the same way as with procD.lm
, by passing both models to the anova
function. (With our example dataset here, we don't have another variable that makes sense to add to the model.)
Organisms vary. The extent of that variation (disparity) may also be different among different groups. Are these differences greater than expected by chance? That can be tested, as shown below.
morphol.disparity(coords ~ Csize, groups = ~ species, data = wing.gpa$gdf)
The output is a table of pairwise mean Procrustes distances and p-values.
We don't have balanced sampling by species or caste in this dataset, making this a rather inappropriate comparison. We could either increase sampling for the underrepresented species, or we could subset the data to only those species with good sample sizes.
c(with(wing.gpa$gdf, by(species,species, length))) c(with(workers$gdf, by(species,species, length)))
bimac bor ferv imp san tern terri vag 23 3 2 28 1 26 2 12 bimac bor ferv imp tern terri vag 12 3 2 26 26 2 8
Let's just compare workers from the three most heavily sampled species.
x <- which(workers$gdf$species %in% c("bimac","imp","tern")) well.sampled.workers <- subsetGMM(workers, specimens = x) morphol.disparity(coords ~ Csize, groups = ~ species, data = well.sampled.workers$gdf)
Procrustes variances for defined groups bimac imp tern 0.0007636383 0.0005756033 0.0005171270 Pairwise absolute differences between variances bimac imp tern bimac 0.0000000000 1.880350e-04 2.465113e-04 imp 0.0001880350 0.000000e+00 5.847636e-05 tern 0.0002465113 5.847636e-05 0.000000e+00 P-Values bimac imp tern bimac 1.000 0.050 0.008 imp 0.050 1.000 0.426 tern 0.008 0.426 1.000
From these results we can conclude that workers of bimac
(B. bimaculatus) have significantly more variation in wing shapes than workers of the other two species.
An anatomical module is a group of structures, such as landmarks, that vary less among one another within the module, than they do relative to structures outside the module. In other words, if we were to examine the variance in distances between all landmarks in our dataset, if we consider some group of landmarks to be a module, then within-group variance should be low (and between group variance should be higher) than predicted by chance (Adams & Collyer 2019).
The geomorph
function modularity.test
implements such a test for two or module modules. Modules are defined by a character vector describing the membership of landmarks in each module.
The most sensible modularity hypotheses will be suggested by anatomical or developmental evidence of greater independence among the hypothetical modules. There are often separate developmental mechanisms governing the proximal and distal regions of appendages, such as the wing. So as an example, we'll test the hypothesis that the wing exhibits modularity in proximal and distal regions, defined by a division just distal of the pterostigma.
modularity.hypothesis1 <- rep("proximal",wing.gpa$landmark.number) modularity.hypothesis1[c(1,4,5,6:9,17)] <- "distal"
We’re ready to test whether the two regions exhibit modularity. The function modularity.test
requires an array of shape data in the argument A
. The modularity hypothesis is passed into the argument partition.gp
. The argument CI
is a logical parameter that says whether the function should calculate a confidence interval on the test statistic. (That’s usually useful.) Finally, iter
specifies the number of iterations.
This function is very computationally demanding. Use a low value of iter
to explore hypotheses early on, remembering that the lowest possible p-value is 1/iter
. Then for confirmation, invest compute time in a run with a higher iter
value for greater precision.
mt1 <- modularity.test(wing.gpa$gdf$coords, modularity.hypothesis1, CI=TRUE, iter=99 ) summary(mt1)
CR: 0.8787 P-value: 0.01 Effect Size: -3.1206 Based on 1000 random permutations Confidence Intervals 0.8171 Confidence Intervals 0.9708
These results support this modularity hypothesis. However, testing several variations on the hypothesis can be helpful to identify the bounds of a module.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.