R/safe_taxonomic_reinsertion.R
In Claddis: Measuring Morphological Diversity and Evolutionary Tempo

Documented in safe_taxonomic_reinsertion

#' Reinsert Safely Removed Taxa Into A Tree
#'
#' @description
#'
#' Safely reinsert taxa in a tree after they were removed from a matrix by Safe Taxonomic Reduction.
#'
#' @param input_filename A Newick-formatted tree file containing tree(s) without safely removed taxa.
#' @param output_filename A file name where the newly generated trees will be written out to (required).
#' @param str_taxa The safe taxonomic reduction table as generated by \link{safe_taxonomic_reduction}.
#' @param multiple_placement_option What to do with taxa that have more than one possible reinsertion position. Options are \code{"exclude"} (does not reinsert them; the default) or \code{"random"} (picks one of the possible positions and uses that - will vary stochastically if multiple trees exist).
#'
#' @details
#'
#' The problem with Safe Taxonomic Reduction (\link{safe_taxonomic_reduction}) is that it generates trees without the safely removed taxa, but typically the user will ultimately want to include these taxa and thus there is also a need to perform "Safe Taxonomic Reinsertion".
#'
#' This function performs that task, given a Newick-formatted tree file and a list of the taxa that were safely removed and the senior taxon and rule used to do so (i.e., the \code{$str_taxa} part of the output from \link{safe_taxonomic_reduction}).
#'
#' Note that this function operates on tree files rather than reading the trees directly into R (e.g., with \link[ape]{ape}'s \link[ape]{read.tree} or \link[ape]{read.nexus} functions) as in practice this turned out to be impractically slow for the types of data sets this function is intended for (supertrees or metatrees). Importantly this means the function operates on raw Newick text strings and hence will only work on data where there is no extraneous information encoded in the Newick string, such as node labels or branch lengths.
#'
#' Furthermore, in some cases safely removed taxa will have multiple taxa with which they can be safely placed. These come in two forms. Firstly, the multiple taxa can already form a clade, in which case the safely removed taxon will be reinserted in a polytomy with these taxa. In other words, the user should be aware that the function can result in non-bifurcating trees even if the input trees are all fully bifurcating. Secondly, the safely removed taxon can have multiple positions on the tree where it can be safely reinserted. As this generates ambiguity, by default (\code{multiple_placement_option = "exclude"}) these taxa will simply not be reinserted. However, the user may wish to still incorporate these taxa and so an additional option (\code{multiple_placement_option = "random"}) allows these taxa to be inserted at any of its' possible positions, chosen at random for each input topology (to give a realistic sense of phylognetic uncertainty. (Note that an exhaustive list of all possible combinations of positions is not implemented as, again, in practice this turned out to generate unfeasibly large numbers of topologies for the types of applications this function is intended for.)
#'
#' @return
#'
#' A vector of taxa which were not reinserted is returned (will be empty if all taxa have been reinserted) and a file is written to (\code{output_filename}).
#'
#' @author Graeme T. Lloyd \email{graemetlloyd@@gmail.com}
#'
#' @seealso
#'
#' \link{safe_taxonomic_reduction}
#'
#' @examples
#'
#' # Generate dummy four taxon trees (where taxa B, D and F were
#' # previously safely excluded):
#' trees <- ape::read.tree(text = c("(A,(C,(E,G)));", "(A,(E,(C,G)));"))
#'
#' # Write trees to file:
#' ape::write.tree(phy = trees, file = "test_in.tre")
#'
#' # Make dummy safe taxonomic reduction taxon list:
#' str_taxa <- matrix(data = c("B", "A", "rule_2b", "D", "C", "rule_2b",
#'   "F", "A", "rule_2b", "F", "C", "rule_2b"), byrow = TRUE, ncol = 3,
#'   dimnames = list(c(), c("junior", "senior", "rule")))
#'
#' # Show that taxa B and D have a single possible resinsertion position,
#' # but that taxon F has two possible positions (with A or with C):
#' str_taxa
#'
#' # Resinsert taxa safely (F will be excluded due to the ambiguity of
#' # its' position - multiple_placement_option = "exclude"):
#' safe_taxonomic_reinsertion(input_filename = "test_in.tre",
#'   output_filename = "test_out.tre", str_taxa = str_taxa,
#'   multiple_placement_option = "exclude")
#'
#' # Read in trees with F excluded:
#' exclude_str_trees <- ape::read.tree(file = "test_out.tre")
#'
#' # Show first tree with B and D reinserted:
#' ape::plot.phylo(x = exclude_str_trees[[1]])
#'
#' # Repeat, but now with F also reinserted with its' position (with
#' # A or with C) chosen at random:
#' safe_taxonomic_reinsertion(input_filename = "test_in.tre",
#'   output_filename = "test_out.tre", str_taxa = str_taxa,
#'   multiple_placement_option = "random")
#'
#' # Read in trees with F included:
#' random_str_trees <- ape::read.tree(file = "test_out.tre")
#'
#' # Confirm F has now also been reinserted:
#' ape::plot.phylo(x = random_str_trees[[1]])
#'
#' # Clean up example files:
#' file.remove(file1 = "test_in.tre", file2 = "test_out.tre")
#'
#' @export safe_taxonomic_reinsertion
safe_taxonomic_reinsertion <- function(input_filename, output_filename, str_taxa, multiple_placement_option = "exclude") {

  # Add some data checks!!!!!

  # Ensure str taxa is formatted as characters:
  str_taxa <- cbind(junior = as.character(str_taxa[, "junior"]), senior = as.character(str_taxa[, "senior"]), rule = as.character(str_taxa[, "rule"]))

  # Read in tree file as raw_text:
  raw_text <- readLines(input_filename)

  # Remove spaces to ensure grep/gsub works properly later:
  raw_text <- gsub(pattern = " ", replacement = "", x = raw_text)

  # Re-sort str list by Junior taxon:
  str_taxa <- str_taxa[order(str_taxa[, "junior"]), ]

  # Find number of seniors for each junior:
  junior_names_and_numbers <- rle(as.character(str_taxa[, "junior"]))

  # Make list of taxa that have a single senior:
  single_position_taxa <- junior_names_and_numbers$values[junior_names_and_numbers$lengths == 1]

  # Collapse to just those not in a polytomy with other taxa in the str list:
  single_position_taxa <- single_position_taxa[is.na(match(single_position_taxa, str_taxa[, "senior"]))]

  # For taxa that have a single replacement:
  if (length(x = single_position_taxa) > 0) {

    # For each single replacement taxon:
    for (i in 1:length(x = single_position_taxa)) {

      # Reinsert into tree next to its senior:
      raw_text <- gsub(pattern = str_taxa[match(single_position_taxa[i], str_taxa[, "junior"]), "senior"], replacement = paste("(", paste(str_taxa[match(single_position_taxa[i], str_taxa[, "junior"]), c("junior", "senior")], collapse = ","), ")", sep = ""), x = raw_text)
    }

    # Remove single replacement taxa from str list:
    str_taxa <- str_taxa[-match(single_position_taxa, str_taxa[, "junior"]), ]

    # Update names and numbers now taxa have been removed:
    junior_names_and_numbers <- rle(as.character(sort(x = str_taxa[, "junior"])))
  }

  # Only worth continuing if there are names still to reinsert:
  if (length(x = junior_names_and_numbers$values) > 0) {

    # Vector to store taxa that only occur in a single polytomy:
    multiple_position_taxa <- vector(mode = "character")

    # For each taxon:
    for (i in 1:length(x = junior_names_and_numbers$values)) {

      # Get taxon name:
      taxon_name <- junior_names_and_numbers$values[i]

      # Find its seniors:
      taxon_seniors <- str_taxa[str_taxa[, "junior"] == taxon_name, "senior"]

      # If its seniors, except for one, are all also juniors then record it (this finds taxa that only exist in a single polytomy in the original tree):
      if (length(x = grep(TRUE, is.na(match(taxon_seniors, str_taxa[, "junior"])))) <= 1) multiple_position_taxa <- c(multiple_position_taxa, taxon_name)
    }

    # If there are taxa that only occur in a single polytomy:
    if (length(x = multiple_position_taxa) > 0) {

      # Reorder from most senior to least:
      taxa_to_delete <- multiple_position_taxa <- multiple_position_taxa[order(junior_names_and_numbers$lengths[match(multiple_position_taxa, junior_names_and_numbers$values)], decreasing = TRUE)]

      # Whilst there are still polytomous taxa in the list:
      while (length(x = multiple_position_taxa) > 0) {

        # Get taxon name with most juniors:
        taxon_name <- multiple_position_taxa[1]

        # Find all of its seniors:
        taxon_seniors <- str_taxa[grep(TRUE, str_taxa[, "junior"] == taxon_name), "senior"]

        # Find senior taxon (already in tree):
        senior_taxon <- taxon_seniors[grep(TRUE, is.na(match(taxon_seniors, str_taxa[, "junior"])))]

        # Replace senior with all juniors in polytomy:
        raw_text <- gsub(pattern = senior_taxon, replacement = paste("(", paste(sort(x = c(taxon_name, taxon_seniors)), collapse = ","), ")", sep = ""), x = raw_text)

        # Remove taxa just dealt with from multiple_position_taxa:
        multiple_position_taxa <- multiple_position_taxa[-sort(x = match(c(taxon_name, taxon_seniors), multiple_position_taxa))]
      }

      # Trims str list down to remaining taxa:
      for (i in 1:length(x = taxa_to_delete)) str_taxa <- str_taxa[-grep(TRUE, str_taxa[, "junior"] == taxa_to_delete[i]), , drop = FALSE]
    }
  }

  # Only keep going if there are taxa still to reinsert:
  if (length(x = str_taxa[, 1]) > 0) {

    # If the user wishes to reinsert remaining taxa at random:
    if (multiple_placement_option == "random") {

      # List unique juniors:
      unique_junior_taxa <- rle(str_taxa[, "junior"])$values[order(rle(str_taxa[, "junior"])$lengths)]

      # For each junior taxon remaining:
      for (i in 1:length(x = unique_junior_taxa)) {

        # Isolate junior taxon name:
        junior_taxon <- unique_junior_taxa[i]

        # Get a senior for each tree:
        taxon_seniors <- sample(str_taxa[grep(TRUE, str_taxa[, "junior"] == junior_taxon), "senior"], length(x = raw_text), replace = TRUE)

        # Remove junior from str list:
        str_taxa <- str_taxa[-grep(TRUE, str_taxa[, "junior"] == junior_taxon), ]

        # Make replacements:
        replacements <- paste("(", paste(junior_taxon, taxon_seniors, sep = ","), ")", sep = "")

        # For each tree:
        for (j in 1:length(x = raw_text)) {

          # Isolate tree:
          newick_string <- raw_text[j]

          # Case if senior ends in a parenthesis:
          if (length(x = grep(paste(taxon_seniors[j], ")", sep = ""), newick_string)) > 0) {

            # Replace senior with combination of junior and senior:
            newick_string <- gsub(pattern = paste(taxon_seniors[j], ")", sep = ""), replacement = paste(replacements[j], ")", sep = ""), x = newick_string)
          }

          # Case if senior ends in a comma:
          if (length(x = grep(paste(taxon_seniors[j], ",", sep = ""), newick_string)) > 0) {

            # Replace senior with combination of junior and senior:
            newick_string <- gsub(pattern = paste(taxon_seniors[j], ",", sep = ""), replacement = paste(replacements[j], ",", sep = ""), x = newick_string)
          }

          # Update tree raw_text:
          raw_text[j] <- newick_string
        }
      }

      # Set uninserted list to null (i.e. empty):
      non_reinserted_taxa <- NULL
    }

    # If the user wishes to exclude remaining taxa:
    if (multiple_placement_option == "exclude") {

      # Set uninserted list to remaining juniors:
      non_reinserted_taxa <- unique(x = str_taxa[, "junior"])
    }

    # If taxa all have a single reinsertion position:
  } else {

    # Set uninserted list to null (i.e. empty):
    non_reinserted_taxa <- NULL
  }

  # Write out tree file:
  write(raw_text, output_filename)

  # Return non_reinserted_taxa:
  non_reinserted_taxa
}