To measure the effect of the our inapplicable algorithm, compared to treating inapplicable tokens as missing data or an extra state, we ran a normal parsimony search on 22 matrices from 10 taxonomic classes (plants and animals) with variable number of characters, taxa and inapplicable data.
| |group | taxa| characters| % NA characters| % NA cells| |:-------------|:-------------|----:|----------:|---------------:|----------:| |Aguado2009 |Annelids | 76| 107| 40.2| 5.0| |Asher2005 |Mammals | 23| 126| 31.7| 3.9| |Baker2009 |Angiosperms | 205| 105| 35.2| 8.3| |Bouchenak2010 |Angiosperms | 818| 417| 81.5| 21.1| |Dikow2009 |Insects | 88| 220| 8.2| 1.6| |Eklund2004 |Angiosperms | 54| 131| 31.3| 10.0| |Geisler2001 |Mammals | 68| 186| 46.8| 2.0| |Giles2015 |Fishes | 78| 236| 74.2| 17.3| |Griswold1999 |Spiders | 43| 137| 43.8| 9.6| |Liljeblad2008 |Insects | 68| 308| 31.5| 6.4| |Loconte1991 |Angiosperms | 56| 104| 63.5| 4.1| |Longrich2010 |Dinosaurs | 19| 119| 26.1| 5.7| |OMeara2013 |Mammals | 63| 317| 44.5| 5.4| |Rougier2012 |Mammals | 58| 317| 2.8| 0.4| |Sharkey2011 |Insects | 105| 392| 100.0| 20.1| |Sundue2010 |Pteridophytes | 177| 237| 81.9| 26.5| |Vinther2008 |Annelids | 23| 57| 43.9| 20.9| |Wiens2010 |Squamates | 64| 363| 41.9| 7.7| |Wilson2003 |Crustaceans | 61| 165| 43.0| 11.1| |Wortley2006 |Angiosperms | 37| 105| 20.0| 3.7| |Zanol2012 |Annelids | 80| 230| 69.1| 17.3| |Zhu2013 |Fishes | 75| 253| 70.8| 19.3|
From these matrices, we ran a heuristic search in PAUP treating all inapplicable data as missing, and saving the island of n trees (default PAUP options). The length of of the trees on this island is called S~MP~ the overall most parsimonious score. All trees on this island have the same length S~MP~.
The length of the trees from these island are then re-calculated by treating inapplicable as an extra state (or "new state") or by using our inapplicable algorithm. This results in two new lists of n scores (for the n trees of each island) with S~NS~ and S~NA~ being the variable tree lengths for each topology under respectively the "new state" and inapplicable counting algorithms.
We then calculated how many trees were not the shortest in S~NS~ and S~NA~ (i.e. how many scores in these distributions were greater than the minimum) to measure how many trees were not discarded when treating inapplicable as missing data. Finally, we calculated how longer (or shorter!) were the S~NS~ and S~NA~ scores compared to the S~MP~ to see how both "new state" and the inapplicable algorithm modify tree length in comparison to treating inapplicable data as missing data.
Getting the functions
source("Functions/read.tree.score.R") source("Functions/standardise.score.R") source("Functions/plot.scores.R") source("Functions/test.R")
The most parsimonious islands treating inapplicable data as missing were obtained using PAUP* (v 4.0a151) with random sequences addition replicated 100 times with a maximum of 5 x 10^6 rearrangements per replicates (hsearch addseq=random nreps=100 rearrlimit=5000000 limitperrep=yes
in PAUP).
The length of the resulting trees within the islands were then measured using PAWM with inapplicable data treated as a new state or using our inapplicable algorithm.
We can load these results using the following:
## Setting the path path <- "Data/Scores/" ## Getting the chain names chains <- list.files(path, pattern = ".morphy.inapplicable.count") chains <- unlist(strsplit(chains, split = ".morphy.inapplicable.count")) ## Remove Sutton2012 (no NA) chains <- chains[-which(chains == "Sutton2012")] ## Loading the tree scores scores <- sapply(chains, read.tree.score, path, simplify = FALSE) ## Loading the matrices matrices <- lapply(chains, read.matrices, path = "Data/Matrices")
## Get the characteristics of each matrix ## Get my Claddis version install_github("TGuillerme/Claddis") library(Claddis) ## Empty matrix matrix_characteristics <- matrix(NA, ncol = 5, nrow = length(chains)) for(matrix in 1:length(chains)) { ## Get the matrix mat_tmp <- ReadMorphNexus(paste0("Data/Matrices/", chains[[matrix]], ".nex"), gap.as.missing = FALSE)$matrix ## Number of taxa matrix_characteristics[matrix, 2] <- nrow(mat_tmp) ## Number of characters matrix_characteristics[matrix, 3] <- ncol(mat_tmp) ## Proportion of NA char matrix_characteristics[matrix, 4] <- length(which(apply(mat_tmp, 2, function(X) any(na.omit(X) == "-"))))/ncol(mat_tmp)*100 ## Proportion of NA cells matrix_characteristics[matrix, 5] <- length(which(mat_tmp == "-"))/(ncol(mat_tmp)*nrow(mat_tmp))*100 } ## Convert into a data.frame matrix_characteristics <- data.frame(matrix_characteristics) matrix_characteristics[,2:5] <- as.numeric(matrix_characteristics[,2:5]) ## Groups matrix_characteristics[, 1] <- c("Annelids", "Mammals", "Angiosperms", "Angiosperms", "Insects", "Angiosperms", "Mammals", "Fishes", "Spiders", "Insects", "Angiosperms", "Dinosaurs", "Mammals", "Mammals", "Insects", "Pteridophytes", "Annelids", "Squamates", "Crustaceans", "Angiosperms", "Annelids", "Fishes") ## Add the rownames and colnames colnames(matrix_characteristics) <- c("group", "taxa", "characters", "% NA characters", "% NA cells") rownames(matrix_characteristics) <- chains ## markdown: knitr::kable(matrix_characteristics, digits = 1) ## LaTeX xtable::xtable(matrix_characteristics, digits = 1)
The scores are then standardise to reflect the proportional length of the trees compared to the shortest tree for both S~NS~ and S~NA~.
## Normalising the score (dividing by the shortest score from each individual search) scores_short <- mapply(standardise.score, scores, std = "short", SIMPLIFY = FALSE)
We can then measure from these standardised scores, how many trees would have been discarded from a default PAUP search (i.e. how values in S~NS~ and S~NA~ are higher than their minimum). Here, the proportion of discarded trees are the number of trees in the most parsimonious island (under each algorithm) that are one step or more longer than the shortest tree.
## Getting the proportion of discarded trees proportions <- lapply(scores_short, get.proportion) ## Only selecting these proportions for inapplicable or extra state inapp_proportions <- unlist(lapply(proportions, `[[`, 1)) extra_proportions <- unlist(lapply(proportions, `[[`, 3)) ## Combining the data into a data frame proportions_combined <- data.frame(inapp_proportions, extra_proportions) colnames(proportions_combined) <- c("Inapplicable", "New state")
We can then visualise this as follows:
op <- par(bty = "n") boxplot(proportions_combined, ylab = "Proportion of discarded (longer) trees") par(op)
## Summary table knitr::kable(t(apply(proportions_combined, 2, summary)), digits = 2)
We see here a clear effect of the counting algorithm used with nearly 100% of trees discarded (the results are rounded) when not treating inapplicable data as missing data. All these trees would have still be saved in a default analysis. Note that, although their quantiles overlap, our inapplicable algorithm produces, on average, slightly less trees that are longer than treating inapplicable data as a new state.
This results however, only indicate how many longer trees are generated under these two counting methods, not how long each of them are. We can compare each tree length S~NS~ and S~NA~ to their length S~MP~ when treating inapplicable data as missing. This information can be added to the previous graph where in a two dimensional way: the vertical information is the same as above (proportion of discarded trees) and the horizontal gives information on, when trees are discarded, what are their minimum and maximum length compared to S~MP~.
## Normalising the score (based on the MP island) scores_MP <- mapply(standardise.score, scores, std = "MP", SIMPLIFY = FALSE) ## Getting the proportion of discarded trees proportions_MP <- lapply(scores_MP, get.proportion) ## Only selecting these proportions for inapplicable or extra state inapp_proportions_MP <- unlist(lapply(proportions_MP, `[[`, 1)) extra_proportions_MP <- unlist(lapply(proportions_MP, `[[`, 3)) ## Combining the data into a data frame proportions_MP <- data.frame(inapp_proportions_MP, extra_proportions_MP) colnames(proportions_MP) <- c("Inapplicable", "New state") ## Getting the relative minimum and maximum tree length from both distributions max_scores <- lapply(scores_MP, lapply, max) min_scores <- lapply(scores_MP, lapply, min) ## Only selecting these proportions for inapplicable or extra state inapp_max <- unlist(lapply(max_scores, `[[`, 1)) extra_max <- unlist(lapply(max_scores, `[[`, 3)) inapp_min <- unlist(lapply(min_scores, `[[`, 1)) extra_min <- unlist(lapply(min_scores, `[[`, 3)) ## Updating the data frame (using the inapp and extra proportion from before) proportions_combined <- data.frame(inapp_proportions, inapp_min, inapp_max, extra_proportions, extra_min, extra_max)
op <- par(bty = "n") sauronplot(proportions_combined, names = c("Inapplicable", "New state"), xlab = "Max/min proportional extra length to MP trees", ylab = "Proportion of discarded trees") par(op)
Again, on the y axis is the same as above but represented as a line: the black dot is the median and the solid and dashed lines are the 50% and 95% confidence intervals respectively. The relative additional tree length to the overall most parsimonious tree is represented on the x axis. Each deviation represents the proportional extra length of the shortest and longest tree respectively on the left and right from the 0 line.
## Summary table colnames(proportions_combined)[c(2,3,5,6)] <- c("Minimum extra length (NA)", "Maximum extra length (NA)", "Minimum extra length (NS)", "Maximum extra length (NS)") knitr::kable(t(apply(proportions_combined[,c(2,3,5,6)], 2, summary)), digits = 2)
Although both algorithms can generate longer trees (even for the minimum extra length), our inapplicable algorithm always generates trees less than 50% longer than the overall most parsimonious trees (apart from some trees were nearly every trees would have been discarded in the search).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.