Effect of character correlation

Before starting

Loading the functions

Getting the functions

library(dispRity)
library(violplot)
source("Functions/read.tree.score.R")
source("Functions/standardise.score.R")
source("Functions/plot.scores.R")
source("Functions/test.R")

Loading the data

The different tree scores are calculated by first running an heuristic search, treating inapplicable data as missing, and saving the island. The length of the trees on this island is then measured by treating inapplicable as an extra state (new state) or by using our inapplicable algorithm. This results in three list of tree length of n elements (where n is the number of trees on the most parsimonious island): the most parsimonious scores when treating inapplicable tokens as missing; the scores when treating inapplicable as a new state and the scores when treating inapplicable using our algorithm.

## Setting the path
path <- "Data/Scores/"

## Getting the chain names
chains <- list.files(path, pattern = ".morphy.inapplicable.count")
chains <- unlist(strsplit(chains, split = ".morphy.inapplicable.count"))

## Loading the tree scores
scores <- sapply(chains, read.tree.score, path, simplify = FALSE)

## Loading the matrices
matrices <- lapply(chains, read.matrices, path = "Data/Matrices")

The scores are then standardise to reflect the proportional length of the trees compared to the most parsimonious tree.

## Normalising the score
scores_std <- mapply(standardise.score, scores, matrices, std = "MP",
                     SIMPLIFY = FALSE)
## Getting the proportion of trees that are not the shorter
proportions <- lapply(scores, get.proportion)
inapp_proportions <- unlist(lapply(proportions, `[[`, 1))
extra_proportions <- unlist(lapply(proportions, `[[`, 3))
proportions_combined <- data.frame(inapp_proportions, extra_proportions)
colnames(proportions_combined) <- c("Inapplicable", "New state")
boxplot(proportions_combined, ylab = "Proportion of rejected (longer) trees")

Supplementary (useless) analysis

Here we want to test whether using our algorithm actually affects the tree length compared to treating the inapplicable tokens as missing or as a new state.

## Extracting the score using the inapplicable token or a new state
scores_inapp <- lapply(scores_std, `[[`, 1)
scores_newstate <- lapply(scores_std, `[[`, 3)

## Defining the plot limits
ylim <- c(min(c(unlist(scores_inapp), unlist(scores_newstate))),
          max(c(unlist(scores_inapp), unlist(scores_newstate))))
cols <- c("orange", "blue")

## Plotting the relative new state score
boxplot(scores_newstate, las = 2, ylab = "relative score", col = cols[1],
        outline = FALSE, border = cols[1], ylim = ylim, main = "")
boxplot(scores_inapp, xlab = "", ylab = "", main = "", col = cols[2],
        outline = FALSE, border = cols[2], add = TRUE, xaxt = "n", yaxt = "n",
        ylim = ylim)
legend("topleft", legend = c("new state", "inapplicable"), col = cols[1:2],
        pch = 15, cex = 1)

The following plots shows the increase in tree length from treating the inapplicable tokens as missing data (resulting in a relative tree length of 0 after normalisation). It appears that our algorithm algorithm counts the length of the most parsimonious trees systematically longer but also systematically smaller than when treating inapplicable tokens as a new state (which is the expected behaviour).

Here we thus test whether the scores obtained from our algorithm are different from 0 (i.e. there is a change is tree length). The difference is tested using a ranked Wilcoxon test. The p-values are adjusted for the number of tests using the Bonferonni-Holm correction.

## Testing the difference between each pair of methods
test.differences(scores_std, test = wilcox.test, correction = TRUE, what = 1)

For a couple of datasets, the results of using our inapplicable algorithm are not different than 0.

TG: need to correct for number in inapplicable characters among the datasets

Let's now measure whether the algorithm is still different that using inapplicable data as a new state.

We can first measure the probability of overlap between both distributions (inapplicable algorithm vs. new state) using the Bhattacharrya Coefficient:

## Measuring the probability of distribution overlap
test.differences(scores_std, test = bhatt.coeff)

Seems pretty clear...

Second, we can actually compare the means of the distribution similarly as above to compare whether the two algorithms (inapplicable vs. new state) yields significant different results:

## Testing the difference between each pair of methods
test.differences(scores_std, test = wilcox.test, correction = TRUE)

Apart from the couple ones that have low number of inapplicable data (see comment above), there is a difference in using our algorithm (producing shorter trees than treating inapplicable data as new states).



TGuillerme/Inapp documentation built on Feb. 4, 2024, 7:26 a.m.