R/detect_mixed_genomes.R

Defines functions detect_mixed_genomes

Documented in detect_mixed_genomes

# detect mixed genomes
#' @title Detect mixed genomes
#' @description Highlight outliers individual's observed heterozygosity for a quick
#' diagnostic of mixed samples or poor polymorphism discovery due to DNA quality,
#' sequencing effort, etc.

#' @param data (4 options) A file or object generated by radiator:
#' \itemize{
#' \item tidy data
#' \item Genomic Data Structure (GDS)
#' }
#'
#' \emph{How to get GDS and tidy data ?}
#' Look into \code{\link{tidy_genomic_data}},
#' \code{\link{read_vcf}} or
#' \code{\link{tidy_vcf}}.

#' @param detect.mixed.genomes (optional, logical) For use inside radiator pipelines.
#' Default: \code{detect.mixed.genomes = TRUE}.


#' @param ind.heterozygosity.threshold (string, double, optional)
#' Blacklist individuals based on observed heterozygosity (averaged across markers).
#'
#'
#' The string contains 2 thresholds values (min and max).
#' The values are proportions (0 to 1), where 0 turns off the min threshold and
#' 1 turns off the max threshold.
#' Individuals with mean observed heterozygosity higher (>) or lower (<)
#' than the thresholds will be blacklisted.
#'
#' Default: \code{ind.heterozygosity.threshold = NULL} will turn off completely
#' the filter and the function will only output the plots and table of heterozygosity.


#' @param by.strata (optional, logical)
#' Can only be used when \code{interactive.filter = TRUE}, allows to enter
#' heterozygosity thresholds by strata, instead of overall. This is not recommended
#' to use inside \code{\link{filter_rad}} or when doing population discovery.
#' This is a good use when dealing with different species.
#' Default: \code{by.strata = FALSE}.


#' @inheritParams radiator_common_arguments

#' @details To help discard an individual based on his observed heterozygosity
#' (averaged across markers),
#' use the manhanttan plot to:
#' \enumerate{
#' \item contrast the individual with population and overall samples.
#' \item visualize the impact of missigness information (based on population or
#' overall number of markers) and the individual observed heterozygosity. The
#' larger the point, the more missing genotypes.
#' }
#' \strong{Outlier above average:}
#' \itemize{
#' \item potentially represent two samples mixed together (action: blacklist), or...
#' \item a sample with more sequecing effort (point size small): did you merge your replicates fq files ? (action : keep and monitor)
#' \item a sample with poor sequencing effort (point size large) where the genotyped markers are
#' all heterozygotes, verify this with missingness (action: discard)
#' }
#' In all cases, if there is no bias in the population sequencing effort,
#' the size of the point will usually be "average" based on the population or
#' overall number of markers.
#'
#'
#' You can visualize individual observed heterozygosity, choose thresholds and
#' then visualize, choose thresholds and filter markers based on observed
#' heterozygosity in one run with: \pkg{radiator} \code{\link{filter_het}}.

#'
#' \strong{Outlier below average:}
#' \itemize{
#' \item A point with a size larger than the population or overall average (= lots of missing):
#' the poor polymorphism discovery of the sample is probably the result of bad
#' DNA quality, a bias in sequencing effort, etc. (action: blacklist)
#' \item A point with a size that looks average (not much missing):
#' this sample requires more attention (action: blacklist) and conduct more tests.
#' e.g. for biallelic data, look for coverage imbalance between ALT/REF allele.
#' At this point you need to distinguish between an artifact of poor polymorphism discovery
#' or a biological reason (highly inbred individual, etc.).
#' }
#'
#' \strong{heterozygosity and missing data:}
#' If you see a pattern with the heterozygosity and missing data,
#' try changing the genotyping rate required to keep markers and/or individuals.


#' @return The function returns inside the global environment a list with
#' 5 objects:
#'
#' \enumerate{
#' \item the individual's heterozigosity (\code{$individual.heterozygosity})
#' a dataframe containing for each individual, the population id, the number of
#' genotyped markers, the number of missing genotypes (based on the number of
#' markers of the population and overall), the number of markers genotyped as heterozygote
#' and it's proportion based on the number of genotyped markers.
#' \item the heterozygosity statistics per populations and overall:\code{$heterozygosity.statistics}
#' \item the blacklisted individuals if \code{ind.heterozygosity.threshold} was selected: \code{$blacklist.ind.het}
#' \item the boxplot of individual heterozygosity:\code{$individual.heterozygosity.boxplot}
#' \item the manhattan plot of individual heterozygosity (\code{$individual.heterozygosity.manhattan.plot})
#' contrasted with missingness proportion based on the number of markers (population or overall).
#' The 2 facets will be identical when the dataset as common markers between
#' the populations. The dotted lines are the mean hetegozygosities.
#' }

#' @rdname detect_mixed_genomes
#' @export
#' @examples
#' \dontrun{
#' #Step1: highlight outlier individuals, the simplest way to run:
#'
#' data <- radiator::detect_mixed_genomes(data = "wombat_tidy.rad")
#' # where the .rad file is a tidy data set generated by radiator.
#'
#' # Or if you don't have a tidy df:
#'
#' data <- radiator::tidy_genomic_data(
#'     data = "wombat.vcf",
#'     strata = "strata.wombat.tsv"
#'     ) %>%
#' radiator::detect_mixed_genomes(.)
#'
#' }

#' @note Thanks to Peter Grewe for very useful comments
#' on previous version of this function.

#' @author Thierry Gosselin \email{thierrygosselin@@icloud.com}

detect_mixed_genomes <- function(
  data,
  interactive.filter = TRUE,
  detect.mixed.genomes = TRUE,
  ind.heterozygosity.threshold = NULL,
  by.strata = FALSE,
  verbose = TRUE,
  parallel.core = parallel::detectCores() - 1,
  ...
) {

  ## Test
  # interactive.filter = TRUE
  # detect.mixed.genomes = TRUE
  # ind.heterozygosity.threshold = NULL
  # verbose = TRUE
  # path.folder <- NULL
  # parameters = NULL
  # parallel.core = parallel::detectCores() - 1
  # by.strata = FALSE
  # internal <- FALSE

  if (interactive.filter || detect.mixed.genomes) {
    if (interactive.filter) verbose <- TRUE

    # Cleanup-------------------------------------------------------------------
    radiator_function_header(f.name = "detect_mixed_genomes", verbose = verbose)
    file.date <- format(Sys.time(), "%Y%m%d@%H%M")
    if (verbose) message("Execution date@time: ", file.date)
    old.dir <- getwd()
    opt.change <- getOption("width")
    options(width = 70)
    timing <- radiator_tic()
    #back to the original directory and options
    on.exit(setwd(old.dir), add = TRUE)
    on.exit(options(width = opt.change), add = TRUE)
    on.exit(radiator_toc(timing, verbose = verbose), add = TRUE)
    on.exit(radiator_function_header(f.name = "detect_mixed_genomes", start = FALSE, verbose = verbose), add = TRUE)

    # Function call and dotslist -------------------------------------------------
    rad.dots <- radiator_dots(
      func.name = as.list(sys.call())[[1]],
      fd = rlang::fn_fmls_names(),
      args.list = as.list(environment()),
      dotslist = rlang::dots_list(..., .homonyms = "error", .check_assign = TRUE),
      keepers = c("path.folder", "parameters", "internal"),
      verbose = FALSE
    )

    # Checking for missing and/or default arguments ------------------------------
    if (missing(data)) rlang::abort("data is missing")

    # Folders---------------------------------------------------------------------
    path.folder <- generate_folder(
      f = path.folder,
      rad.folder = "detect_mixed_genomes",
      internal = internal,
      file.date = file.date,
      verbose = verbose)

    # write the dots file
    write_rad(
      data = rad.dots,
      path = path.folder,
      filename = stringi::stri_join("radiator_detect_mixed_genomes_args_", file.date, ".tsv"),
      tsv = TRUE,
      internal = internal,
      verbose = verbose
    )

    # Detect format --------------------------------------------------------------
    data.type <- radiator::detect_genomic_format(data)

    if (!data.type %in% c("SeqVarGDSClass", "gds.file")) {

      # Tidy data
      data <- radiator::tidy_wide(data = data, import.metadata = TRUE)
      n.pop <- dplyr::n_distinct(data$STRATA)

      # Filter parameter file: generate and initiate -------------------------------
      filters.parameters <- radiator_parameters(
        generate = TRUE,
        initiate = TRUE,
        update = FALSE,
        parameter.obj = parameters,
        data = data,
        path.folder = path.folder,
        file.date = file.date,
        internal = internal,
        verbose = verbose)

      # highlight heterozygote and missing (optimized for speed depending on data)
      # you see the difference with > 30K SNP

      if (tibble::has_name(data, "GT_BIN")) {
        n.markers.pop <- dplyr::filter(data, !is.na(GT_BIN))
        n.markers.overall <- dplyr::n_distinct(data$MARKERS[!is.na(data$GT_BIN)])
      } else {
        n.markers.pop <- dplyr::filter(data, GT != "000000")
        n.markers.overall <- dplyr::n_distinct(data$MARKERS[data$GT != "000000"])
      }
      n.markers.pop <- dplyr::distinct(n.markers.pop, MARKERS, STRATA) %>%
        dplyr::count(x = ., STRATA)

      # For figure het and missingness:
      # 1 or 2 facets that include missingness computed by pop or overall
      missing.group <- "1"
      if (length(unique(n.markers.pop$n)) == 1 && unique(n.markers.pop$n) == n.markers.overall) {
        missing.group <- "1"
      } else {
        missing.group <- "2"
      }

      if (tibble::has_name(data, "GT_BIN")) {
        het.summary <- dplyr::mutate(
          .data = data,
          HET = dplyr::if_else(GT_BIN == 1, 1, 0, missing = 0)) %>%
          dplyr::group_by(INDIVIDUALS) %>%
          dplyr::mutate(GENOTYPED = length(GT_BIN[!is.na(GT_BIN)])) %>%
          dplyr::ungroup(.)
      } else {
        het.summary <- dplyr::mutate(
          .data = data,
          HET = dplyr::if_else(
            stringi::stri_sub(GT, 1, 3) != stringi::stri_sub(GT, 4, 6), 1, 0)) %>%
          dplyr::group_by(INDIVIDUALS) %>%
          dplyr::mutate(GENOTYPED = length(INDIVIDUALS[GT != "000000"])) %>%
          dplyr::ungroup(.)
      }

      # Step 1. Highlight individual's heterozygosity  -----------------------------
      # Heterozygosity at the individual level before looking at the markers level per population
      # It's a good way to do outlier diagnostic ... mixed individuals

      # Create a new df with heterozygote info

      if (is.factor(het.summary$STRATA)) {
        pop.levels <- levels(het.summary$STRATA)
      } else {
        pop.levels <- unique(het.summary$STRATA)
      }

      het.ind <- dplyr::select(.data = het.summary, STRATA, INDIVIDUALS, HET, GENOTYPED) %>%
        dplyr::full_join(n.markers.pop, by = "STRATA") %>%
        dplyr::group_by(INDIVIDUALS) %>%
        dplyr::mutate(
          MISSING_PROP_POP = (n - GENOTYPED) / n,
          MISSING_PROP_OVERALL = (n.markers.overall - GENOTYPED) / n.markers.overall
        ) %>%
        dplyr::group_by(INDIVIDUALS, STRATA) %>%
        dplyr::summarise(
          GENOTYPED = unique(GENOTYPED),
          MISSING_PROP_POP = unique(MISSING_PROP_POP),
          MISSING_PROP_OVERALL = unique(MISSING_PROP_OVERALL),
          HET_NUMBER = length(HET[HET == 1]),
          HET_PROP = HET_NUMBER / GENOTYPED,
          .groups = "keep"
        ) %>%
        dplyr::arrange(STRATA, HET_PROP) %>%
        dplyr::ungroup(.) %>%
        readr::write_tsv(
          x = .,
          file = file.path(path.folder, "individual.heterozygosity.tsv"),
          col_names = TRUE)

      het.ind.overall <- dplyr::mutate(.data = het.ind, STRATA = as.character(STRATA)) %>%
        dplyr::bind_rows(dplyr::mutate(.data = het.ind, STRATA = rep("OVERALL", n()))) %>%
        dplyr::mutate(STRATA = factor(STRATA, levels = c(pop.levels, "OVERALL"))) %>%
        radiator::rad_long(
          x = .,
          cols = c("STRATA", "INDIVIDUALS", "GENOTYPED", "HET_NUMBER", "HET_PROP"),
          names_to = "MISSING_GROUP",
          values_to = "MISSING_PROP",
          variable_factor = FALSE
        ) %>%
        dplyr::mutate(MISSING_GROUP = factor(MISSING_GROUP, levels = c("MISSING_PROP_POP", "MISSING_PROP_OVERALL"))) %>%
        dplyr::arrange(STRATA, MISSING_GROUP)

    } else {
      if (data.type == "gds.file") {
        data <- radiator::read_rad(data, verbose = verbose)
        data.type <- "SeqVarGDSClass"
      }
      radiator_packages_dep(package = "SeqArray", cran = FALSE, bioc = TRUE)


      # Filter parameter file: generate and initiate -------------------------------
      filters.parameters <- radiator_parameters(
        generate = TRUE,
        initiate = TRUE,
        update = FALSE,
        parameter.obj = parameters,
        data = data,
        path.folder = path.folder,
        file.date = file.date,
        internal = internal,
        verbose = verbose)

      # GDS....
      missing.group <- "1"

      id.stats <- generate_stats(
        gds = data,
        markers = FALSE,
        missing = TRUE,
        heterozygosity = TRUE,
        coverage = FALSE,
        allele.coverage = FALSE,
        exhaustive = FALSE,
        path.folder = path.folder,
        plot = FALSE,
        file.date = file.date,
        parallel.core = parallel.core,
        verbose = FALSE)

      het.ind <- id.stats$i.info %>%
        dplyr::select(INDIVIDUALS, STRATA, HET_PROP = HETEROZYGOSITY, MISSING_PROP_OVERALL = MISSING_PROP) %>%
        dplyr::mutate(MISSING_PROP_POP = MISSING_PROP_OVERALL)

      # Need the equivalent for this:
      if (is.factor(het.ind$STRATA)) {
        pop.levels <- levels(het.ind$STRATA)
      } else {
        pop.levels <- unique(het.ind$STRATA)
      }
      n.pop <- length(pop.levels)
      id.stats <- NULL

      het.ind.overall <- dplyr::mutate(.data = het.ind, STRATA = as.character(STRATA)) %>%
        dplyr::bind_rows(dplyr::mutate(.data = het.ind, STRATA = rep("OVERALL", n()))) %>%
        dplyr::mutate(STRATA = factor(STRATA, levels = c(pop.levels, "OVERALL"), ordered = TRUE)) %>%
        radiator::rad_long(
          x = .,
          cols = c("STRATA", "INDIVIDUALS", "HET_PROP"),
          names_to = "MISSING_GROUP",
          values_to = "MISSING_PROP"
        ) %>%
        dplyr::mutate(
          MISSING_GROUP = factor(MISSING_GROUP,
                                 levels = c("MISSING_PROP_POP", "MISSING_PROP_OVERALL")),
          MISSING_PROP = as.numeric(MISSING_PROP)) %>%
        dplyr::arrange(STRATA, MISSING_GROUP)
    }

    #Stats------------------------------------------------------------------------
    message("Calculating heterozygosity statistics")

    het.ind.stats <- dplyr::filter(het.ind.overall,
                                   MISSING_GROUP != "MISSING_PROP_POP") %>%
      dplyr::group_by(STRATA) %>%
      dplyr::summarise(
        MEAN = mean(HET_PROP, na.rm = TRUE),
        MEDIAN = stats::median(HET_PROP, na.rm = TRUE),
        SD = stats::sd(HET_PROP, na.rm = TRUE),
        Q25 = stats::quantile(HET_PROP, 0.25, na.rm = TRUE),
        Q75 = stats::quantile(HET_PROP, 0.75, na.rm = TRUE),
        IQR = stats::IQR(HET_PROP, na.rm = TRUE),
        MIN = min(HET_PROP, na.rm = TRUE),
        MAX = max(HET_PROP, na.rm = TRUE),
        OUTLIERS_LOW = Q25 - (1.5 * IQR),
        OUTLIERS_HIGH = Q75 + (1.5 * IQR),
        OUTLIERS_LOW_N = length(HET_PROP[HET_PROP < OUTLIERS_LOW]),
        OUTLIERS_HIGH_N = length(HET_PROP[HET_PROP > OUTLIERS_HIGH]),
        OUTLIERS_TOTAL = OUTLIERS_HIGH_N + OUTLIERS_LOW_N,
        OUTLIERS_PROP = round(OUTLIERS_TOTAL / length(HET_PROP), 3)) %>%
      dplyr::mutate(dplyr::across(where(is.numeric), .fns = round, digits = 6)) %>%
      tidyr::unite(data = ., HET_RANGE, MIN, MAX, sep = " - ") %>%
      dplyr::arrange(STRATA) %>%
      readr::write_tsv(
        x = .,
        file = file.path(path.folder, "heterozygosity.statistics.tsv"),
        col_names = TRUE)


    # Plots ----------------------------------------------------------------------
    message("Generating plots")

    # outlier info
    overall.outlier.low <- het.ind.stats$OUTLIERS_LOW[het.ind.stats$STRATA == "OVERALL"]
    overall.outlier.high <- het.ind.stats$OUTLIERS_HIGH[het.ind.stats$STRATA == "OVERALL"]

    rounder <- function(x, accuracy, f = round) {
      f(x / accuracy) * accuracy
    }
    y.breaks.by <- rounder(max(het.ind$HET_PROP, na.rm = TRUE)/10, 0.001, ceiling)
    y.breaks.max <- rounder(max(het.ind$HET_PROP, na.rm = TRUE), 0.001, ceiling) + (y.breaks.by / 2)
    y.breaks.min <- rounder(min(het.ind$HET_PROP, na.rm = TRUE), 0.001, ceiling) - (y.breaks.by / 2)
    y.breaks <- seq(y.breaks.min, y.breaks.max, by = y.breaks.by)

    # labeller to rename in the facet_grid or facet_wrap call:
    if (missing.group == "2") {
      facet_names <- ggplot2::as_labeller(c(`MISSING_PROP_OVERALL` = "Missing (overall)",
                                            `MISSING_PROP_POP` = "Missing (populations)"))
    } else {
      facet_names <- ggplot2::as_labeller(c(`MISSING_PROP_OVERALL` = "Missing (overall)"))
      het.ind.overall <- dplyr::filter(het.ind.overall, MISSING_GROUP != "MISSING_PROP_POP")
    }
    # manhattan ----------------------------------------------------------------
    het.manhattan <- ggplot2::ggplot(
      data = het.ind.overall,
      ggplot2::aes(x = STRATA, y = HET_PROP, size = as.numeric(MISSING_PROP), colour = STRATA)) +
      ggplot2::geom_jitter(alpha = 0.6) +
      ggplot2::scale_y_continuous(name = "Mean Observed Heterozygosity (proportion)",
                                  breaks = y.breaks, labels = y.breaks, limits = c(y.breaks.min, y.breaks.max)) + #y.breaks
      ggplot2::scale_color_discrete(guide = "none") +
      ggplot2::scale_size_continuous(name = "Missing proportion") +
      ggplot2::labs(
        title = "Genome-wide individual's mean observed heterozygosity",
        subtitle = stringi::stri_join(
          "overall data outlier thresholds (low/high): ", overall.outlier.low, "/",
          overall.outlier.high)
      ) +
      ggplot2::theme(
        panel.grid.major.x = ggplot2::element_blank(),
        plot.title = ggplot2::element_text(size = 12, family = "Helvetica", face = "bold", hjust = 0.5),
        plot.subtitle = ggplot2::element_text(size = 10, family = "Helvetica", hjust = 0.5),
        axis.line.x = ggplot2::element_blank(),
        axis.title.x = ggplot2::element_blank(),
        axis.text.x = ggplot2::element_blank(),
        axis.ticks.x = ggplot2::element_blank(),
        axis.title.y = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
        axis.text.y = ggplot2::element_text(size = 8, family = "Helvetica")
      ) +
      ggplot2::geom_hline(mapping = ggplot2::aes(yintercept = OUTLIERS_LOW),
                          het.ind.stats, linetype = "dashed", size = 0.6) + #low
      ggplot2::geom_hline(mapping = ggplot2::aes(yintercept = MEAN),
                          het.ind.stats, linetype = "dotted", size = 0.6) +#mean
      ggplot2::geom_hline(mapping = ggplot2::aes(yintercept = OUTLIERS_HIGH),
                          het.ind.stats, linetype = "dashed", size = 0.6) + #high
      ggplot2::facet_grid(MISSING_GROUP ~ factor(STRATA), switch = "x", scales = "free",
                          labeller = ggplot2::labeller(MISSING_GROUP = facet_names))
    # het.manhattan

    # n.pop <- 20
    ggplot2::ggsave(
      filename = file.path(path.folder, "individual.heterozygosity.manhattan.plot.pdf"),
      plot = het.manhattan,
      width = max(5 * n.pop, 20), height = 10 * as.numeric(missing.group),
      dpi = 600, units = "cm", useDingbats = FALSE, limitsize = FALSE)

    ggplot2::ggsave(
      filename = file.path(path.folder, "individual.heterozygosity.manhattan.plot.png"),
      plot = het.manhattan,
      width = max(5 * n.pop, 20),
      height = 10 * as.numeric(missing.group),
      dpi = 200,
      units = "cm",
      limitsize = FALSE
    )

    het.bp <- ggplot2::ggplot(data = het.ind.overall,
                              ggplot2::aes(x = STRATA, y = HET_PROP, colour = STRATA)) +
      ggplot2::geom_boxplot() +
      ggplot2::labs(
        x = "Populations",
        title = "Genome-wide individual's mean observed heterozygosity",
        subtitle = stringi::stri_join(
          "overall data outlier thresholds (low/high): ", overall.outlier.low, "/",
          overall.outlier.high),
        colour = "Populations") +
      ggplot2::scale_y_continuous(
        name = "Mean Observed Heterozygosity (proportion)",
        breaks = y.breaks, labels = y.breaks, limits = c(y.breaks.min, y.breaks.max)) + #y.breaks
      # breaks = y.breaks, limits = c(0, y.breaks.max), expand = c(0.06, 0)) +
      ggplot2::theme_classic() +
      ggplot2::theme(
        legend.position = "none",
        plot.title = ggplot2::element_text(size = 12, family = "Helvetica", face = "bold", hjust = 0.5),
        plot.subtitle = ggplot2::element_text(size = 10, family = "Helvetica", hjust = 0.5),
        axis.title.x = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
        axis.text.x = ggplot2::element_text(size = 10, family = "Helvetica"),
        axis.title.y = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
        axis.text.y = ggplot2::element_text(size = 8, family = "Helvetica")
      )
    # het.bp
    ggplot2::ggsave(
      filename = file.path(path.folder, "individual.heterozygosity.boxplot.pdf"),
      plot = het.bp,
      width = max(4 * n.pop, 20), height = 10, dpi = 600, units = "cm",
      useDingbats = FALSE, limitsize = FALSE)


    ggplot2::ggsave(
      filename = file.path(path.folder, "individual.heterozygosity.boxplot.png"),
      plot = het.bp,
      width = max(4 * n.pop, 20), height = 10, dpi = 200, units = "cm",
      limitsize = FALSE)


    het.ind.overall <- NULL

    if (interactive.filter) {
      message("\nThe greatest value of a picture is when it forces us
to notice what we never expected to see.
\nJohn W. Tukey. Exploratory Data Analysis. 1977.")

      message("\n\nInspect plots and tables in folder created...")
      bl.id.het <- radiator_question(
        x = "    Do you want to exclude individuals based on heterozygosity ? (y/n): ", answer.opt = c("y", "n"))


      if (bl.id.het == "y") {

        if (by.strata) {
          ind.het.thresholds.by.strata <- het.ind.stats %>%
            dplyr::select(
              STRATA,
              THRESHOLD_OUTLIERS_LOW = OUTLIERS_LOW,
              THRESHOLD_OUTLIERS_HIGH = OUTLIERS_HIGH
            ) %>%
            dplyr::filter(STRATA != "OVERALL")

          readr::write_tsv(
            x = ind.het.thresholds.by.strata,
            file = file.path(path.folder, "ind.het.thresholds.by.strata.tsv"),
            append = FALSE,
            col_names = TRUE
          )
          message("\nFile written: ind.het.thresholds.by.strata.tsv")
          message("In directory: ", path.folder)
          message("\nOpen in a text editor or MS Excel and adjust the thresholds by strata")
          finished <- radiator_question(
            x = "\nWhen finished with the file, save and come back here and type: `y`:",
            answer.opt = c("y", "Y", "yes", "Yes", "YES"))

          ind.het.thresholds.by.strata <- readr::read_tsv(
            file = file.path(path.folder, "ind.het.thresholds.by.strata.tsv"),
            col_types = "cdd")
          ind.heterozygosity.threshold <- ind.het.thresholds.by.strata

        } else {
          mix.text <- "    Enter the min value for ind.heterozygosity.threshold argument (0 turns off): "
          threshold.min <- radiator_question(x = mix.text, minmax = c(0, 1))
          mix.text <- "    Enter the max value for ind.heterozygosity.threshold argument (1 turns off): "
          # threshold.max <- as.numeric(readLines(n = 1))
          threshold.max <- radiator_question(x = mix.text, minmax = c(0, 1))
          ind.heterozygosity.threshold <- as.numeric(c(threshold.min, threshold.max))
        }
      } else {
        threshold.min <- 0
        threshold.max <- 1
        ind.heterozygosity.threshold <- NULL
      }
    } else {
      if (!is.null(ind.heterozygosity.threshold)) {
        threshold.min <- ind.heterozygosity.threshold[1]
        threshold.max <- ind.heterozygosity.threshold[2]
      }
    }

    ## Step 2: Blacklist outlier individuals -------------------------------------
    if (!is.null(ind.heterozygosity.threshold)) {

      if (by.strata) {
        blacklist.ind.het <- dplyr::ungroup(het.ind) %>%
          dplyr::right_join(ind.het.thresholds.by.strata, by = "STRATA") %>%
          dplyr::filter(HET_PROP > THRESHOLD_OUTLIERS_HIGH | HET_PROP < THRESHOLD_OUTLIERS_LOW) %>%
          dplyr::distinct(INDIVIDUALS)
      } else {
        blacklist.ind.het  <- dplyr::ungroup(het.ind) %>%
          dplyr::filter(HET_PROP > threshold.max | HET_PROP < threshold.min) %>%
          dplyr::distinct(INDIVIDUALS)
      }


      message("Filter individual's heterozygosity: ", length(blacklist.ind.het$INDIVIDUALS), " individual(s) blacklisted")
      readr::write_tsv(
        x = blacklist.ind.het,
        file = file.path(path.folder, "blacklist.ind.het.tsv"),
        col_names = TRUE
      )

    } else {
      blacklist.ind.het <- "ind.heterozygosity.threshold is necessary to get a blacklist of individuals"
    }
    check.mono <- FALSE
    if (!is.null(nrow(blacklist.ind.het)) && nrow(blacklist.ind.het) > 0) {
      check.mono <- TRUE
      n.ind.blacklisted <- length(blacklist.ind.het$INDIVIDUALS)

      if (data.type == "SeqVarGDSClass") {

        id.info <- extract_individuals_metadata(gds = data, whitelist = FALSE) %>%
          dplyr::mutate(
            FILTERS = dplyr::if_else(
              INDIVIDUALS %in% blacklist.ind.het$INDIVIDUALS,
              "detect_mixed_genomes", FILTERS
            )
          )

        update_radiator_gds(gds = data, node.name = "individuals.meta", value = id.info, sync = TRUE)

      } else {
        data  %<>% dplyr::filter(!INDIVIDUALS %in% blacklist.ind.het$INDIVIDUALS)
      }
    }#End blacklisted


    # updating parameters
    if (by.strata && !is.null(ind.heterozygosity.threshold)) {
      ind.het.thresholds.by.strata %<>%
        dplyr::mutate(
          THRESHOLDS = stringi::stri_join(THRESHOLD_OUTLIERS_LOW, THRESHOLD_OUTLIERS_HIGH, sep = "/"),
          VALUES = stringi::stri_join(STRATA, THRESHOLDS, sep = ": ")
        )
      filters.parameters <- radiator_parameters(
        generate = FALSE,
        initiate = FALSE,
        update = TRUE,
        parameter.obj = filters.parameters,
        data = data,
        filter.name = "detect mixed genomes",
        param.name = "ind.heterozygosity.threshold (min/max) by strata",
        values = paste(ind.het.thresholds.by.strata$VALUES, collapse = ", "),
        path.folder = path.folder,
        file.date = file.date,
        internal = internal,
        verbose = verbose)
    } else {
      filters.parameters <- radiator_parameters(
        generate = FALSE,
        initiate = FALSE,
        update = TRUE,
        parameter.obj = filters.parameters,
        data = data,
        filter.name = "detect mixed genomes",
        param.name = "ind.heterozygosity.threshold (min/max)",
        values = paste(threshold.min, threshold.max, collapse = " / "),
        path.folder = path.folder,
        file.date = file.date,
        internal = internal,
        verbose = verbose)
    }
    # results ------------------------------------------------------------------
    if (by.strata && !is.null(ind.heterozygosity.threshold)) {
      radiator_results_message(
        rad.message = stringi::stri_join("Detect mixed genomes: ",
                                         paste(ind.het.thresholds.by.strata$VALUES, collapse = ", ")),
        filters.parameters,
        internal,
        verbose
      )
    } else {
      radiator_results_message(
        rad.message = stringi::stri_join("Detect mixed genomes: ",
                                         paste(threshold.min, threshold.max,
                                               collapse = " / ")),
        filters.parameters,
        internal,
        verbose
      )
    }

    # MONOMORPHIC MARKERS --------------------------------------------------
    if (check.mono) {
      data <- filter_monomorphic(
        data = data,
        parallel.core = parallel.core,
        verbose = FALSE,
        parameters = filters.parameters,
        path.folder = path.folder,
        internal = FALSE)
    }


  }#before this one
  return(data)
}#End detect_mixed_genomes
thierrygosselin/radiator documentation built on May 5, 2024, 5:12 a.m.