R/filter_hwe.R

Defines functions filter_hwe

Documented in filter_hwe

#' @name filter_hwe
#' @title Filter markers based on Hardy-Weinberg Equilibrium
#' @description Testing markers for Hardy-Weinberg proportions is a valuable tool
#' for the analysis and quality control of RADseq datasets.
#' HWE can highligh genotyping errors, presence of null alleles,
#' sequence duplication, copy number variation and other sequencing problems
#' related to read depth.
#' This function is designed for
#' \strong{bi-allelic markers} and uses the exact test from the package
#' HardyWeinberg. The function is speedy because it uses the
#' C++ code developed by Christopher Chang and also available in PLINK.
#' The p-value generated by the function is the mid p-value. It's computed
#' as half the probability of the current sample + the probabilities of all
#' samples that are more extreme (see references below). Several output are
#' generated to help users filter the data (see details).
#'
#' \strong{sampling sites vs well defined populations:} be careful what strata
#' you're investigating and adjust filtering threshold accordinghly for
#' downstream analysis. If you're still at the populations discovery steps,
#' markers under HWD are totally normal.
#'
#' Prior to HW filtering, I highly recommend removing outlier individuals,
#' filtering coverage and genotype likelihood (see details).


#' @param data A tidy data frame object in the global environment or
#' a tidy data frame in wide or long format in the working directory.
#' \emph{How to get a tidy data frame ?}
#' Look into \pkg{radiator} \code{\link{tidy_genomic_data}}.

#' @param filter.hwe (optional, logical) Used inside radiator pipeline.
#' Default: \code{filter.hwe = TRUE}.

#' @param hw.pop.threshold (integer, optional)
#' Remove markers that have a certain number of pops in Hardy-Weinberg
#' disequilibrium.
#' With default, all populations in dataset need to be in HWD before discarding
#' the marker.
#' Default: \code{hw.pop.threshold = NULL}.

#' @inheritParams radiator_common_arguments


#' @param midp.threshold (integer, optional)
#' By default the function generates blacklists/whitelists of markers and
#' filtered tidy datasets for the 5 mid p-value.
#' However, to get a final filtered object associated with the output of the
#' function, user need to choose one
#' of the 5 mid p-value:
#' \itemize{
#' \item \code{1} = 0.05 (*)
#' \item \code{2} = 0.001 (**)
#' \item \code{3} = 0.0001 (***)
#' \item \code{4} = 0.0001 (****)
#' \item \code{5} = 0.00001 (*****)
#' }.
#' With default, a very conservative mid p-value threshold is selected.
#' Default: \code{midp.threshold = 4}.

#' @param filename (optional) The function uses \code{\link[fst]{write.fst}},
#' to write the tidy data frame in
#' the folder created in the working directory. The file extension appended to
#' the \code{filename} provided is \code{.rad}.
#' Default: \code{filename = NULL}.

#' @inheritParams tidy_genomic_data
#' @inheritParams read_strata
#' @inheritParams radiator_common_arguments

#' @rdname filter_hwe
#' @export

#' @details
#' \strong{Interactive version}
#'
#' The user is asked to look at figures before choosing filter thresholds.
#'
#'
#' \strong{HWE threshold}
#'
#' I recommend starting with a low threshold.
#' Serious genotyping errors will generate extreme p-values (e.g. 1e-50),
#' which are detected by any reasonable configuration of this test,
#' while various life-history caracteristics will deflate/inflate
#' Hardy-Weinberg equilibrium.
#' Consequently, it's dangerous to choose a threshold that filters out too many
#' markers.
#'
#' \strong{strategies:}
#' Disk space is cheap! Consequently, the function will automatically generate
#' several blacklists/whitelists of markers and
#' filtered tidy data (in the directory)
#' based on the \code{hw.pop.threshold} for 5 groups of mid p-values:
#' \itemize{
#' \item 1: MID_P_VALUE <= 0.00001: *****
#' \item 2: MID_P_VALUE <= 0.0001: ****
#' \item 3: MID_P_VALUE <= 0.001: ***
#' \item 4: MID_P_VALUE <= 0.01: **
#' \item 5: MID_P_VALUE <= 0.05: *
#' }
#'
#' Test the sensitivity of downstream analysis and delete unwanted datasets.
#'
#' \strong{missing data}
#'
#' The mid-p adjustment tends to bring the null rejection rate in line with the
#' nominal p-value, and also reduces the filter's tendency to favor retention
#' of SNPs with missing data (Graffelman and Moreno, 2013, Purcell et al., 2007).
#'
#' If pattern of missing data is present in the dataset, or when missing data
#' accross markers vary by more than 0.10, you should not apply a single mid-p-value
#' threshold accross markers.
#'
#' \strong{Read depth, pooling lanes/chips and weird pattern of individual heterozygosity}
#'
#' Because of read depth, heterozygote deficiency is usually observed in RADseq data,
#' but if sequencing lanes/chips were combined to generate individuals with more
#' coverage, the situation will likely be the reverse: heterozygote excess.
#' If lanes/chips pooling was used or if highly variable sequencing coverage is
#' observed between individuals and/or markers, there's a couple qc and filtering
#' steps you should do before conducting HWE filtering.
#'
#' \itemize{
#' \item run \pkg{radiator} \code{\link{detect_mixed_genomes}} and
#' \code{\link{detect_het_outliers}} to highlight
#' outlier individuals with potential heterozygosity problems and get an idea of
#' the genotyping and heterozygote miscall rate
#' \item I recommend normalizing the data before \emph{de novo} assembly, if this
#' is not possible...
#' \item use radiator::filter_rad or a more chirurgical approach to
#' coverage and genotype likelihood with radiator::filter_rad.
#' }
#'
#'
#' \strong{Permutation test}
#' Hardy-Weinberg equilibrium refers to the statistical independence of alleles
#' within individuals. This independence can also be assessed by permutation test
#' inside the HardyWeinberg package of Jan Graffelman. To filter out markers with
#' genotyping problems the approach provided in this function is enough.
#'
#' \strong{Power test for HWE}
#' Based on allele count/frequency and sample size. This function is longer to
#' generate for each markers and is on my todo list to include it in this filter.
#'
#'
#' \strong{Markers under selection and genome scans:}
#'
#' Scared of deleting those precious markers or that the filter might interfere
#' with genome scan analysis/detection ? Don't be. Your markers or analysis is no
#' good if it's done on bad data... Test the sensitivity of your downstream
#' analysis with the datasets generated with the different thresholds.


#' @note
#' \strong{Hardy-Weinberg assumptions (refresh):}
#' \enumerate{
#' \item Diploid organisms
#' \item Only reproduction sexual occurs
#' \item Generations are non-overlapping
#' \item Mating is random
#' \item Population size is infinitely large
#' \item Allele frequencies are equal in the sexes
#' \item Migration, Mutation and Selection are negligible
#' }
#'


#' @return A list in the global environment objects:
#' \enumerate{
#' \item $path.folder: the path to the folder generated.
#' \item $hw.pop.threshold: the number of populations tolerated to be in HWD before
#' blacklisting the markers.
#' \item $plot.hwd.thresholds: useful figure that highlight the number of markers
#' blacklisted based on the number of populations in HWD and mid p-value thresholds.
#' \item $plot.tern: ternary plot of markers (currently unavailable until ggtern is updated).
#' \item $hw.manhattan: manhattan plot of markers in Hardy-Weinberg disequilibrium.
#' \item $hwe.pop.sum: a summary tibble with populations, number of markers in total,
#' number of markers monomorphic for the populations,
#' number of markers in Hardy-Weinberg Equilibrium (HWE),
#' number of markers in Hardy-Weinberg Dquilibrium (HWD) with all the different
#' mid p-values observed on the data.
#' \item $midp.threshold: the mid p-value threshold chosen for the final dataset (next)
#' \item $tidy.hw.filtered: the final filtered dataset (oter datasets \code{.rad}
#' are generated automatically by the function, check the folder)
#' }
#'
#' Written in the folder:
#' \enumerate{
#' \item genotypes.summary.tsv: A tibble with these columns:
#' \code{MARKERS, STRATA, HET, HOM_ALT, HOM_REF, MISSING, N,
#' FREQ_ALT, FREQ_REF, FREQ_HET, FREQ_HOM_REF_O, FREQ_HET_O, FREQ_HOM_ALT_O,
#' FREQ_HOM_REF_E, FREQ_HET_E, FREQ_HOM_ALT_E, N_HOM_REF_EXP,
#' N_HET_EXP, N_HOM_ALT_EXP, HOM_REF_Z_SCORE, HOM_HET_Z_SCORE,
#' HOM_ALT_Z_SCORE, READ_DEPTH}

#' \item hw.pop.sum.tsv: a summary tibble with populations, number of markers in total,
#' number of markers monomorphic for the populations,
#' number of markers in Hardy-Weinberg Equilibrium (HWE),
#' number of markers in Hardy-Weinberg Dquilibrium (HWD) with all the different
#' mid p-values observed on the data.
#' \item hwd.helper.table.tsv: useful tibble that highlight the number of markers
#' blacklisted based on the number of populations in HWD and mid p-value thresholds.
#' \item hwd.plot.blacklist.markers.pdf: useful figure that highlight the number of markers
#' blacklisted based on the number of populations in HWD and mid p-value thresholds.
#' \item hwe.manhattan.plot.pdf: manhattan plot of markers in Hardy-Weinberg disequilibrium.
# \item hwe.ternary.plots.missing.data.pdf: ternary plot of markers.
#' \item tidy.filtered.hwe.xxx.mid.p.value.xxx.hw.pop.threshold.rad: several
#' tidy dataset filtered with different mid p value and populations in HWD thresholds
#' \item whitelist.markers.hwe.xxx.mid.p.value.xxx.hw.pop.threshold.tsv: several
#' whitelist of  markers with different mid p value and populations in HWD thresholds
#' \item blacklist.markers.hwd.xxx.mid.p.value.xxx.hw.pop.threshold.tsv: several
#' blacklist of markers with different mid p value and populations in HWD thresholds
#' }
#'
#'


#' @examples
#' \dontrun{
#' require(HardyWeinberg)
#' library(radiator)
#' # for the interactive version (recommended)
#' turtle.pop <- radiator::filter_hwe(
#'    data = "turtle.vcf",
#'    strata = "turtle.strata.tsv",
#'    filename = "hwe.turtle"
#' )
#' }

#' @references Weir, B.S. (1996) Genetic data analysis II. Sinauer Associates,
#' Massachusetts. See Chapter3.
#'
#' @references Wigginton, J.E., Cutler, D.J. and Abecasis, G.R. (2005)
#' A note on exact tests of Hardy-Weinberg equilibrium,
#' American Journal of Human Genetics (76) pp. 887-893.
#'
#' @references Purcell et al. (2007) PLINK: A Toolset for Whole-Genome Association
#' and Population-Based Linkage Analysis.
#' American Journal of Human Genetics 81(3) pp. 559-575.
#'
#' @references Graffelman, J. and Moreno, V. (2013) The mid p-value in exact
#' tests for Hardy-Weinberg equilibrium, Statistical Applications in Genetics
#' and Molecular Biology 12(4) pp. 433-448.
#'
#' @references Graffelman J, Jain D, Weir B (2017)
#' A genome-wide study of Hardy-Weinberg equilibrium with next generation
#' sequence data. Human Genetics, 136, 727-741.


filter_hwe <- function(
  data,
  interactive.filter = TRUE,
  filter.hwe = TRUE,
  strata = NULL,
  hw.pop.threshold = NULL,
  midp.threshold = 4L,
  filename = NULL,
  parallel.core = parallel::detectCores() - 1,
  verbose = TRUE,
  ...
) {
  # obj.keeper <- c(ls(envir = globalenv()), "data")

  if (interactive.filter || filter.hwe) {
    if (interactive.filter) verbose <- TRUE
    ##Testing
    # interactive.filter = TRUE
    # filter.hwe <- TRUE
    # # data <- gds
    # strata = NULL
    # hw.pop.threshold = NULL
    # midp.threshold = 4L
    # filename = NULL
    # parallel.core = parallel::detectCores() - 1
    # verbose = TRUE
    # path.folder <- getwd()
    # parameters = NULL
    # internal <- FALSE

    # required package
    radiator_packages_dep(package = "HardyWeinberg")
    # radiator_packages_dep(package = "ggtern")
    radiator_function_header(f.name = "filter_hwe", verbose = verbose)

    # Cleanup---------------------------------------------------------------------
    file.date <- format(Sys.time(), "%Y%m%d@%H%M")
    if (verbose) message("Execution date@time: ", file.date)
    old.dir <- getwd()
    opt.change <- getOption("width")
    options(width = 70)
    timing <- radiator_tic()
    res <- list()
    #back to the original directory and options
    on.exit(setwd(old.dir), add = TRUE)
    on.exit(options(width = opt.change), add = TRUE)
    on.exit(radiator_toc(timing), add = TRUE)
    on.exit(radiator_function_header(f.name = "filter_hwe", start = FALSE, verbose = verbose), add = TRUE)
    # on.exit(rm(list = setdiff(ls(envir = sys.frame(-1L)), obj.keeper), envir = sys.frame(-1L)))

    # Function call and dotslist -------------------------------------------------
    rad.dots <- radiator_dots(
      func.name = as.list(sys.call())[[1]],
      fd = rlang::fn_fmls_names(),
      args.list = as.list(environment()),
      dotslist = rlang::dots_list(..., .homonyms = "error", .check_assign = TRUE),
      keepers = c("path.folder", "parameters", "internal"),
      verbose = FALSE
    )

    # Checking for missing and/or default arguments ------------------------------
    if (missing(data)) rlang::abort("data is missing")
    if (is.null(midp.threshold)) midp.threshold <- 4L


    # Message about steps taken during the process ---------------------------------
    if (interactive.filter) {
      message("Interactive mode: on")
      if (!is.null(hw.pop.threshold)) {
        message("Disabling hw.pop.threshold value")
        message("   you'll have to enter the value manually after visualization")
      }
    }
    # Folders---------------------------------------------------------------------
    path.folder <- generate_folder(
      f = path.folder,
      rad.folder = "filter_hwe",
      internal = internal,
      file.date = file.date,
      verbose = verbose)

    # write the dots file
    write_rad(
      data = rad.dots,
      path = path.folder,
      filename = stringi::stri_join("radiator_filter_hwe_args_", file.date, ".tsv"),
      tsv = TRUE,
      internal = internal,
      write.message = "Function call and arguments stored in: ",
      verbose = verbose
    )

    # File type detection----------------------------------------------------------
    data.type <- radiator::detect_genomic_format(data)

    if (data.type %in% c("SeqVarGDSClass", "gds.file")) {
      if (data.type == "gds.file") {
        data <- radiator::read_rad(data, verbose = verbose)
      }
      gds.bk <- data
      data <- gds2tidy(gds = data, pop.id = FALSE, parallel.core = parallel.core)
      data %<>% dplyr::rename(STRATA = tidyselect::any_of("POP_ID"))
      data.type <- "tbl_df"
    } else {
      data.bk <- data
      gds.bk <- NULL
    }


    if (data.type %in% c("tbl_df", "fst.file")) {
      message("    using tidy data frame of genotypes as input")
      message("    skipping all filters")

      if (data.type == "fst.file") {
        data <- radiator::read_rad(data)
      }
    } else {
      data <- radiator::tidy_genomic_data(
        data = data,
        vcf.metadata = TRUE,
        blacklist.genotype = NULL,
        strata = strata,
        filename = NULL,
        parallel.core = parallel.core,
        verbose = FALSE
      )
    }

    # Filter parameter file: initiate ------------------------------------------
    filters.parameters <- radiator_parameters(
      generate = TRUE,
      initiate = TRUE,
      update = FALSE,
      parameter.obj = parameters,
      data = data,
      path.folder = path.folder,
      file.date = file.date,
      internal = internal,
      verbose = verbose)

    # create a strata.df
    strata <- radiator::generate_strata(data = data, pop.id = FALSE)

    if (is.factor(strata$STRATA)) {
      pop.id.levels <- levels(strata$STRATA)
    } else {
      pop.id.levels <- unique(strata$STRATA)
    }

    # Check that at least 10 ind/pop ---------------------------------------------
    pop.removed <- dplyr::group_by(strata, STRATA) %>%
      dplyr::tally(.) %>%
      dplyr::filter(n < 10) %>%
      dplyr::distinct(STRATA) %>%
      dplyr::mutate(STRATA = as.character(STRATA)) %>%
      purrr::flatten_chr(.)

    if (length(pop.removed) > 0) {
      message("\n\nStrata removed from analysis because n < 10: ", stringi::stri_join(pop.removed, collapse = ", "))

      if (length(pop.removed) == length(pop.id.levels)) {
        message("\n\nAll strata were removed. Stopping analysis. Returning original data.\n\n")
        run.analysis <- FALSE

        if (!is.null(gds.bk)) {
          return(gds.bk)
        } else {
          return(data.bk)
        }
      } else {
        run.analysis <- TRUE
        message("    Note: removed strata are included back in datasets at the end\n\n")
        data.temp <- dplyr::filter(data, STRATA %in% pop.removed)
        data <- dplyr::filter(data, !STRATA %in% pop.removed)
        if (is.factor(data$STRATA)) data$STRATA <- droplevels(data$STRATA)
        strata <- dplyr::filter(strata, !STRATA %in% pop.removed)
      }

    } else {
      data.temp <- NULL
      run.analysis <- TRUE
    }

    # run.analysis  --------------------------------------------------------------
    if (run.analysis) {
      # Check that data is biallelic
      biallelic <- radiator::detect_biallelic_markers(data, parallel.core = parallel.core)
      if (!biallelic) rlang::abort("This filter requires bi-allelic data")

      # prepare filter, table and figure------------------------------------------
      if (verbose) message("Summarizing data")
      sample.size <- length(unique(data$INDIVIDUALS))
      want <- c("MARKERS", "STRATA", "N", "MISSING", "HOM_REF", "HET", "HOM_ALT", "READ_DEPTH")
      data.sum <- summarise_genotypes(data, path.folder = path.folder) %>%
        dplyr::rename(STRATA = tidyselect::any_of("POP_ID")) %>%
        dplyr::select(tidyselect::any_of(want)) %>%
        dplyr::rename(AA = HOM_REF, AB = HET, BB = HOM_ALT)
      if (is.factor(data.sum$STRATA)) {
        pop.levels <- levels(data.sum$STRATA)
      } else {
        pop.levels <- unique(data.sum$STRATA)
      }
      n.pop <- length(pop.levels)
      if (verbose) message("File written: genotypes.summary.tsv")

      # HWE analysis -------------------------------------------------------------
      data.sum <- hwe_analysis(x = data.sum, parallel.core = 1) %>%
        dplyr::mutate(
          STRATA = factor(STRATA, levels = pop.levels),
          GROUPINGS = factor(
            x = GROUPINGS,
            levels = c("monomorphic", "hwe", "*", "**", "***", "****", "*****"),
            labels = c("monomorphic", "hwe",
                       "midp <= 0.05 (*)",
                       "midp <= 0.01 (**)",
                       "midp <= 0.001 (***)",
                       "midp <= 0.0001 (****)",
                       "midp <= 0.00001 (*****)")),
          MISSING_PROP = MISSING / (MISSING + N)
        )
      # Step 1. Impact of population threshold on marker discovery------------------
      prop_join <- function(x, y) {
        stringi::stri_join(x, " (", round(x / y, 3), ")")
      }

      hwe.pop.sum <- data.sum %>%
        dplyr::group_by(STRATA) %>%
        dplyr::summarise(
          MARKERS_TOTAL = length(MARKERS),
          MONOMORPHIC = length(MARKERS[MONO]),
          HWE = length(MARKERS[HWE & !is.na(HWE)]),
          `HWD*` = length(MARKERS[`*` & !is.na(`*`)]),
          `HWD**` = length(MARKERS[`**` & !is.na(`**`)]),
          `HWD***` = length(MARKERS[`***` & !is.na(`***`)]),
          `HWD****` = length(MARKERS[`****` & !is.na(`****`)]),
          `HWD*****` = length(MARKERS[`****` & !is.na(`*****`)])
        ) %>%
        dplyr::ungroup(.) %>%
        readr::write_tsv(x = ., file = file.path(path.folder, "hw.pop.sum.tsv"))
      if (verbose) message("File written: hw.pop.sum.tsv")

      hwd.markers.pop.sum <- data.sum %>%
        dplyr::filter(!HWE, STRATA != "OVERALL") %>%
        dplyr::select(STRATA, MARKERS, `*`, `**`, `***`, `****`, `*****`) %>%
        radiator::rad_long(
          x = .,
          cols = c("MARKERS", "STRATA"),
          names_to = "SIGNIFICANCE",
          values_to = "VALUE",
          variable_factor = FALSE
        ) %>%
        dplyr::mutate(
          SIGNIFICANCE = factor(SIGNIFICANCE,
                                levels = c("*", "**", "***", "****", "*****"))) %>%
        dplyr::filter(VALUE) %>%
        dplyr::group_by(MARKERS, SIGNIFICANCE) %>%
        dplyr::tally(.) %>%
        dplyr::rename(N_POP_HWD = n) #%>%dplyr::mutate(N_POP_HWD = n.pop -1 - N_POP_HWD)# To get the tolerance to...

      hwd.levels <- sort(unique(hwd.markers.pop.sum$N_POP_HWD))
      n.markers <- dplyr::n_distinct(data$MARKERS)
      `Exact test mid p-value` <- NULL

      overall <- data.sum %>%
        dplyr::filter(!HWE, STRATA == "OVERALL") %>%
        dplyr::select(STRATA, MARKERS, `*`, `**`, `***`, `****`, `*****`) %>%
        radiator::rad_long(
          x = .,
          cols = c("MARKERS", "STRATA"),
          names_to = "SIGNIFICANCE",
          values_to = "VALUE",
          variable_factor = FALSE
        ) %>%
        dplyr::mutate(SIGNIFICANCE = factor(
          x = SIGNIFICANCE,
          levels = c("*", "**", "***", "****", "*****"))) %>%
        dplyr::filter(VALUE) %>%
        dplyr::group_by(MARKERS, SIGNIFICANCE) %>%
        dplyr::tally(.) %>%
        dplyr::rename(N_POP_HWD = n) %>%
        dplyr::group_by(SIGNIFICANCE, N_POP_HWD) %>%
        dplyr::tally(.) %>%
        dplyr::ungroup(.) %>%
        dplyr::mutate(N_POP_HWD = "OVERALL")

      hwd.helper.table.long <- hwd.markers.pop.sum %>%
        dplyr::group_by(SIGNIFICANCE, N_POP_HWD) %>%
        dplyr::tally(.) %>%
        dplyr::ungroup(.) %>%
        dplyr::mutate(N_POP_HWD = as.character(N_POP_HWD)) %>%
        dplyr::bind_rows(overall) %>%
        dplyr::mutate(
          N_POP_HWD = factor(
            x = N_POP_HWD, levels = c("OVERALL", hwd.levels))
        ) %>%
        tidyr::complete(
          data = .,
          SIGNIFICANCE, N_POP_HWD,
          fill = list(n = 0))

      overall <- NULL

      hwd.helper.table <- radiator::rad_wide(
        x = hwd.helper.table.long,
        formula = "N_POP_HWD ~ SIGNIFICANCE",
        values_from = "n"
      ) %>%
        dplyr::filter(N_POP_HWD != 0) %>%
        readr::write_tsv(x = ., file = file.path(path.folder, "hwd.helper.table.tsv"))

      hwd.helper.table.long <- hwd.helper.table.long %>%
        dplyr::rename(`Exact test mid p-value` = SIGNIFICANCE)

      # dot plot with thresholds ---------------------------------------------------
      #Function to replace packageplyr round_any
      rounder <- function(x, accuracy, f = round) {
        f(x / accuracy) * accuracy
      }

      max.markers <- max(hwd.helper.table.long$n)
      if (max.markers >= 1000) {
        y.breaks.by <- rounder(max.markers/10, 100, ceiling)
        y.breaks.max <- rounder(max.markers, 1000, ceiling)
        y.breaks <- seq(0, y.breaks.max, by = y.breaks.by)
      } else {
        y.breaks.by <- rounder(max.markers/10, 10, ceiling)
        y.breaks.max <- rounder(max.markers, 100, ceiling)
        y.breaks <- seq(0, y.breaks.max, by = y.breaks.by)
      }

      plot.hwd.thresholds <- ggplot2::ggplot(
        data = hwd.helper.table.long,
        ggplot2::aes(x = N_POP_HWD, y = n, colour = `Exact test mid p-value`, group = `Exact test mid p-value`)) +
        ggplot2::geom_point(size = 2, shape = 21, fill = "white") +
        ggplot2::geom_line() +
        # ggplot2::scale_x_continuous(name = "Number of populations in HWD", breaks = 1:n.pop - 1) +
        ggplot2::scale_y_continuous(name = "Number of markers blacklisted", breaks = y.breaks, limits = c(0, y.breaks.max)) +
        ggplot2::labs(
          x = "Number of populations in HWD",
          title = "Number of markers blacklisted based on the number of populations in HWD\nand mid p-value thresholds") +
        ggplot2::theme(
          plot.title = ggplot2::element_text(size = 12, face = "bold", hjust = 0.5),
          axis.title.x = ggplot2::element_text(size = 10, face = "bold"),
          axis.title.y = ggplot2::element_text(size = 10, face = "bold"),
          axis.text.x = ggplot2::element_text(size = 8),# , angle = 90, hjust = 1, vjust = 0.5),
          strip.text.x = ggplot2::element_text(size = 10, face = "bold")
        )
      # plot.hwd.thresholds
      if (interactive.filter) print(plot.hwd.thresholds)

      ggplot2::ggsave(
        limitsize = FALSE,
        plot = plot.hwd.thresholds,
        filename = file.path(path.folder, "hwd.plot.blacklist.markers.pdf"),
        width = max(20, n.pop * 5), height = 10,
        dpi = 300, units = "cm", useDingbats = FALSE)
      hwd.helper.table.long <- NULL
      if (verbose) message("Plot written: hwd.plot.blacklist.markers.pdf")

      # if (interactive.filter) {
      #   message("Step 1. Ternary plot visualization")
      # }
      # HardyWeinberg::HWTernaryPlot(
      # X = dplyr::filter(data, STRATA == "ATL") %>%
      # dplyr::select(AA, AB, BB) %>% as.matrix, n = sample.size,
      # region = 1,
      # hwcurve = TRUE,
      # verbose = TRUE)

      # testing with duplicated info removed
      # data.dup <- data2 %>%
      #   dplyr::distinct(MARKERS, AA, AB, BB)

      # ternary plot -----------------------------------------------------------------
      # library(ggtern)
      # num.groups <- dplyr::n_distinct(data.sum$GROUPINGS)
      # if (num.groups == 7) group_colors <- c("grey", "green", "yellow", "orange",
      #                                        "orangered", "red", "darkred")
      # if (num.groups == 6) group_colors <- c("green", "yellow", "orange",
      #                                        "orangered", "red", "darkred")
      # if (num.groups == 5) group_colors <- c("green", "yellow", "orange",
      #                                        "orangered", "red")
      # if (num.groups == 4) group_colors <- c("green", "yellow", "orange",
      #                                        "orangered")
      # if (num.groups == 3) group_colors <- c("green", "yellow", "orange")
      # if (num.groups == 2) group_colors <- c("green", "yellow")
      #
      # # HW Parabola
      # parabola <- tibble::tibble(p = seq(0, 1, by = 0.005)) %>%
      #   dplyr::mutate(AA = p^2, AB = 2 * p * (1 - p), BB = (1 - p)^2, p = NULL)
      # sample.size <- data.sum %>% dplyr::group_by(STRATA) %>%
      #   dplyr::summarise(NN = 2* max(N, na.rm = TRUE))
      #
      # hw_parabola <- function(x, sample.size, parabola) {
      #   pop <- unique(x)
      #   pop.size <- sample.size$NN[sample.size$STRATA == pop]
      #   parabola <- parabola %>%
      #     dplyr::mutate(
      #       STRATA = pop,
      #       NN = pop.size,
      #       AA = AA * NN,
      #       AB = AB * NN,
      #       BB = BB * NN,
      #       NN = NULL,
      #       GROUPINGS = "hwe",
      #       MISSING_PROP = 0)
      #   return(parabola)
      # }

      # hw.parabola <- purrr::map_df(.x = pop.levels, .f = hw_parabola,
      #                              sample.size = sample.size, parabola = parabola) %>%
      #   dplyr::mutate(STRATA = factor(STRATA, pop.levels))
      # parabola <- sample.size <- NULL

      # plot.tern <- ggtern::ggtern(
      #   data = data.sum,
      #   ggtern::aes(AA, AB, BB, color = GROUPINGS, size = MISSING_PROP)) +
      #   ggplot2::scale_color_manual(name = "Exact test mid p-value", values = group_colors) +
      #   ggplot2::scale_size_continuous(name = "Missing genotypes proportion") +
      #   ggplot2::geom_point(alpha = 0.4) +
      #   ggplot2::geom_line(data = hw.parabola, ggplot2::aes(x = AA, y = AB),
      #                      linetype = 2, size = 0.6, colour = "black") +
      #   ggplot2::labs(
      #     x = "AA", y = "AB", z = "BB",
      #     title = "Hardy-Weinberg Equilibrium ternary plots",
      #     subtitle = "genotypes frequencies shown for AA: REF/REF, AB: REF/ALT and BB: ALT/ALT"
      #   ) +
      #   ggplot2::theme(
      #     plot.title = ggplot2::element_text(size = 12, face = "bold", hjust = 0.5),
      #     plot.subtitle = ggplot2::element_text(size = 10, hjust = 0.5)
      #   ) +
      #   ggtern::theme_rgbw() +
      #   ggtern::theme_nogrid_minor() +
      #   ggtern::theme_nogrid_major() +
      #   ggplot2::facet_wrap(~ STRATA)
      # plot.tern
      # ggtern::ggsave(
      #   limitsize = FALSE,
      #   plot = plot.tern,
      #   # filename = file.path(path.folder, "hwe.ternary.plots.read.depth.pdf"),
      #   filename = file.path(path.folder, "hwe.ternary.plots.missing.data.pdf"),
      #   width = max(25, n.pop * 5), height = max(15, n.pop * 4),
      #   dpi = 300, units = "cm", useDingbats = FALSE)
      # hw.parabola <- NULL
      # if (verbose) message("Plot written: hwe.ternary.plots.missing.data.pdf")

      # plot.tern <- "temporarily out of order"

      # Manhattan plot -------------------------------------------------------------
      data.sum.man <- dplyr::mutate(data.sum, X = "x") %>%
        dplyr::filter(MID_P_VALUE < 0.05)
      # rounder <- function(x, accuracy, f = round) {
      #   f(x / accuracy) * accuracy
      # }
      # y.breaks.by <- rounder(max(data.sum$MID_P_VALUE, na.rm = TRUE)/10, 0.001, ceiling)
      # y.breaks.max <- rounder(max(data.sum$MID_P_VALUE, na.rm = TRUE), 0.001, ceiling) + (y.breaks.by / 2)
      # y.breaks.min <- rounder(min(data.sum$MID_P_VALUE, na.rm = TRUE), 0.001, ceiling) - (y.breaks.by / 2)
      # y.breaks <- seq(y.breaks.min, y.breaks.max, by = y.breaks.by)

      num.groups <- dplyr::n_distinct(data.sum.man$GROUPINGS)
      if (num.groups == 5) group_colors <- c("yellow", "orange", "orangered", "red", "darkred")
      if (num.groups == 4) group_colors <- c("yellow", "orange", "orangered", "red")
      if (num.groups == 3) group_colors <- c("yellow", "orange", "orangered")
      if (num.groups == 2) group_colors <- c("yellow", "orange")

      hw.manhattan <- ggplot2::ggplot(
        data = data.sum.man,
        ggplot2::aes(x = X, y = MID_P_VALUE, color = GROUPINGS, size = MISSING_PROP)) +
        ggplot2::geom_jitter(alpha = 0.5) +
        ggplot2::scale_color_manual(name = "Exact test mid p-value", values = group_colors) +
        ggplot2::scale_size_continuous(name = "Missing genotypes proportion") +
        ggplot2::labs(
          x = "Strata",
          y = "Markers mid p-value",
          title = "Manhanttan plot of markers in Hardy-Weinberg disequilibrium"
        ) +
        ggplot2::theme_bw() +
        ggplot2::theme(
          panel.grid.major.x = ggplot2::element_blank(),
          plot.title = ggplot2::element_text(size = 12, face = "bold", hjust = 0.5),
          plot.subtitle = ggplot2::element_text(size = 10, hjust = 0.5),
          axis.line.x = ggplot2::element_blank(),
          # axis.title.x =  ggplot2::element_blank(),
          axis.text.x = ggplot2::element_blank(),
          axis.ticks.x = ggplot2::element_blank(),
          axis.title.y = ggplot2::element_text(size = 10, face = "bold"),
          axis.text.y = ggplot2::element_text(size = 8)
        ) +
        ggplot2::facet_grid(GROUPINGS ~ STRATA, scales = "free")

      # if (interactive.filter) print(hw.manhattan)
      ggplot2::ggsave(
        filename = file.path(path.folder, "hwe.manhattan.plot.pdf"),
        plot = hw.manhattan,
        width = max(20, n.pop * 5), height = 30,
        dpi = 300, units = "cm", useDingbats = FALSE, limitsize = FALSE)
      if (verbose) message("Plot written: hwe.manhattan.plot.pdf")

      # continue with filters or not ------
      if (interactive.filter) {
        hw.q <- "\nDo you want to continue with the filtering ? (y/n):"
        do.hw <- radiator_question(x = hw.q, answer.opt = c("y", "n"))
      } else {
        do.hw <- "y"
      }


      if (do.hw == "y") {

        # generate the blacklist/whitelist -------------------------------------------
        # Generate blacklist of markers with the 4 significance groups
        # threshold (integer) The number of outlier pop you tolerate to deviate from HWE.
        # e.g. if \code{threshold = 2}, blacklist of markers out of HWE in more than (>=)
        # 2 populations will be generated for all significance groupings.

        #leave user with this figure before choosing threshold

        if (is.null(hw.pop.threshold)) hw.pop.threshold <- n.pop - 1
        if (interactive.filter) {
          hw.pop.threshold <- radiator_question(
            x = "\nBased on figures and tables enter the hw.pop.threshold (integer): ",
            minmax = c(0, 100000000))
        }

        # Generating blacklists, whitelists and filtered tidy data -------------------
        if (verbose) message("\nGenerating blacklists, whitelists and filtered tidy data...")
        blacklist_hw(
          x = hwd.markers.pop.sum,
          unfiltered.data = data,
          data.temp = data.temp,
          hw.pop.threshold = hw.pop.threshold,
          path.folder = path.folder,
          pop.id.levels = pop.id.levels)

        # Choosing the last dataset --------------------------------------------
        no.file <- TRUE
        while (no.file) {
          if (interactive.filter) {
            message("\nChoosing the final filtered dataset")
            midp.threshold <- radiator_question(
              x = "   select the mid p-value threshold (5 options):\n1: 0.05 *\n2: 0.01 **\n3: 0.001 ***\n4: 0.0001 ****\n5: 0.00001 *****",
              minmax = c(1, 5)
            )
          }
          midp.threshold <- dplyr::case_when(
            midp.threshold == 5 ~ 0.00001,
            midp.threshold == 4 ~ 0.0001,
            midp.threshold == 3 ~ 0.001,
            midp.threshold == 2 ~ 0.01,
            midp.threshold == 1 ~ 0.05) %>%
            format(., scientific = FALSE)

          # path.folder <- yft.hw$path.folder
          data.filtered.name <- list.files(
            path = path.folder,
            pattern = stringi::stri_join("tidy.filtered.hwe.", midp.threshold),
            full.names = FALSE)
          if (length(data.filtered.name) == 0) {
            message("No file associated with this threshold, choose again")
            no.file <- TRUE
          } else {
            no.file <- FALSE
          }
        }
        message("\nFinal filtered tidy dataset: \n", data.filtered.name)
        message("\nUsing hw.pop.threshold/midp.threshold: ", hw.pop.threshold, "/", midp.threshold)

        data <- list.files(
          path = path.folder,
          pattern = stringi::stri_join("tidy.filtered.hwe.", midp.threshold),
          full.names = TRUE) %>%
          radiator::read_rad(data = .)
      }

      # GDS ----------------------------------------------------------------------
      if (!is.null(gds.bk)) {
        # convert back or filter the gds....

        markers.meta <- extract_markers_metadata(gds = gds.bk)
        bl <- markers.meta %>%
          dplyr::filter(FILTERS == "whitelist") %>%
          dplyr::filter(!MARKERS %in% unique(data$MARKERS))
        markers.meta %<>%
          dplyr::mutate(
            FILTERS = dplyr::if_else(MARKERS %in% bl$MARKERS, "filter.hwe", FILTERS)
          )
        data <- gds.bk
        gds.bk <- bl <- NULL


        update_radiator_gds(
          gds = data,
          node.name = "markers.meta",
          value = markers.meta,
          sync = TRUE)
      }
    }#End run.analysis

    # Update parameters --------------------------------------------------------
    filters.parameters <- radiator_parameters(
      generate = FALSE,
      initiate = FALSE,
      update = TRUE,
      parameter.obj = filters.parameters,
      data = data,
      filter.name = "Filter HWE",
      param.name = paste0("hw.pop.threshold / midp.threshold"),
      values = stringi::stri_join(hw.pop.threshold, midp.threshold,
                                  collapse = " / ", ignore_null = FALSE),
      path.folder = path.folder,
      file.date = file.date,
      internal = internal,
      verbose = verbose)

    # results ------------------------------------------------------------------
    radiator_results_message(
      rad.message = stringi::stri_join("Filter HWE: ", stringi::stri_join(hw.pop.threshold, " / ", midp.threshold,
                                                                          ignore_null = FALSE)),
      filters.parameters,
      internal,
      verbose
    )
  }
  return(data)
}# End filter_hwe

# Internal nested functions ----------------------------------------------------

# hwe analysis
#' @title hwe_analysis
#' @description main hwe function
#' @rdname hwe_analysis
#' @keywords internal
#' @export
#' @author Thierry Gosselin \email{thierrygosselin@@icloud.com}

# Note to myself: will have to enable an option that generate different datasets
# to build confidence interval
# permuting populations


hwe_analysis <- function(x, parallel.core = parallel::detectCores() - 1) {
  hwe_map <- function(x, parallel.core) {
    pop <- as.character(unique(x$STRATA))
    message("HWE analysis for pop: ", pop)
    if (tibble::has_name(x, "STRATA")) x <- dplyr::select(x, -STRATA)
    hwe_radiator <- carrier::crate(function(x, pop = NULL) {
      `%>%` <- magrittr::`%>%`
      `%<>%` <- magrittr::`%<>%`
      mono <- function(x) {
        mono <- length(x$AA[x$AA == 0]) + length(x$AB[x$AB == 0]) + length(x$BB[x$BB == 0])
        if (mono >= 2) {
          mono <- TRUE
        } else {
          mono <- FALSE
        }
        return(mono)
      }

      hw.res <- x %>%
        dplyr::bind_cols(dplyr::rowwise(.) %>% dplyr::do(MONO = mono(.))) %>%
        dplyr::mutate(MONO = unlist(MONO))
      hw.poly <- dplyr::filter(hw.res, !MONO) %>%
        dplyr::select(-MONO)
      markers <- dplyr::select(hw.poly, MARKERS)
      hw.poly <- dplyr::select(hw.poly, AA, AB, BB) #%>% as.matrix

      hw.res <- dplyr::left_join(
        hw.res,
        dplyr::bind_cols(
          markers,
          tibble::tibble(
            MID_P_VALUE = HardyWeinberg::HWExactStats(
              X = hw.poly, plinkcode = TRUE, midp = TRUE)
          )
        )
        , by = "MARKERS") %>%
        dplyr::mutate(
          HWE = MID_P_VALUE >= 0.05,
          SIGNIFICANCE = dplyr::case_when(
            MID_P_VALUE <= 0.00001 ~ "*****",
            MID_P_VALUE <= 0.0001 ~ "****",
            MID_P_VALUE <= 0.001 ~ "***",
            MID_P_VALUE <= 0.01 ~ "**",
            MID_P_VALUE <= 0.05 ~ "*"),
          `*` = MID_P_VALUE <= 0.05,
          `**` = MID_P_VALUE <= 0.01,
          `***` = MID_P_VALUE <= 0.001,
          `****` = MID_P_VALUE <= 0.0001,
          `*****` = MID_P_VALUE <= 0.00001,
          GROUPINGS = SIGNIFICANCE,
          GROUPINGS = dplyr::if_else(
            MONO, "monomorphic",
            dplyr::if_else(HWE, "hwe", GROUPINGS))
        ) %>%
        tibble::add_column(.data = ., STRATA = pop, .after = 1)
      return(hw.res)
    })#hwe_radiator

    x <- radiator_future(
      .x = x,
      .f = hwe_radiator,
      pop = pop,
      flat.future = "dfr",
      split.vec = TRUE,
      split.with = NULL,
      split.chunks = 10L,
      parallel.core = parallel.core,
      forking = TRUE
    )
    return(x)
  }#hwe_map

  x %<>%
    dplyr::group_split(.tbl = ., STRATA, .keep = TRUE) %>%
    purrr::map_df(.x = ., .f = hwe_map, parallel.core = parallel.core)
}#hwe_analysis

#' @title blacklist_hw
#' @description blacklist hw
#' @rdname blacklist_hw
#' @keywords internal
#' @export
#' @author Thierry Gosselin \email{thierrygosselin@@icloud.com}

blacklist_hw <- function(
  x,
  unfiltered.data,
  data.temp,
  hw.pop.threshold,
  path.folder = NULL,
  # filters.parameters.path,
  pop.id.levels
) {
  if (is.null(path.folder)) path.folder <- getwd()

  x <- dplyr::ungroup(x) %>%
    dplyr::filter(N_POP_HWD == hw.pop.threshold)

  bl_map <- function(
    x,
    unfiltered.data,
    hw.pop.threshold,
    path.folder#filters.parameters.path
  ) {

    if (nrow(x) > 0) {
      significance.group <- unique(x$SIGNIFICANCE)
      significance.group <- dplyr::case_when(
        significance.group == "*****" ~ 0.00001,
        significance.group == "****" ~ 0.0001,
        significance.group == "***" ~ 0.001,
        significance.group == "**" ~ 0.01,
        significance.group == "*" ~ 0.05) %>%
        format(., scientific = FALSE)

      blacklist.filename <- file.path(
        path.folder,
        stringi::stri_join(
          "blacklist.markers.hwd", significance.group, "mid.p.value",
          hw.pop.threshold, "hw.pop.threshold", "tsv", sep = "."))

      whitelist.filename <- file.path(
        path.folder,
        stringi::stri_join(
          "whitelist.markers.hwe", significance.group, "mid.p.value",
          hw.pop.threshold, "hw.pop.threshold", "tsv", sep = "."))

      rad.filename <- file.path(
        path.folder,
        stringi::stri_join(
          "tidy.filtered.hwe", significance.group, "mid.p.value",
          hw.pop.threshold, "hw.pop.threshold", "rad", sep = "."))

      # Generate the blacklist
      want <- c("MARKERS", "CHROM", "LOCUS", "POS")
      blacklist <- dplyr::distinct(x, MARKERS) %>%
        dplyr::left_join(
          dplyr::select(unfiltered.data, tidyselect::any_of(want)) %>%
            dplyr::distinct(MARKERS, .keep_all = TRUE)
          , by = "MARKERS") %>%
        readr::write_tsv(
          x = .,
          file = blacklist.filename)

      # Generate the rad data + the whitelist
      if (!is.null(data.temp)) {
        whitelist <- suppressWarnings(
          unfiltered.data %>%
            dplyr::bind_rows(data.temp) %>%
            dplyr::mutate(STRATA = factor(STRATA, levels = pop.id.levels)) %>%
            dplyr::filter(!MARKERS %in% blacklist$MARKERS))
        radiator::write_rad(data = whitelist, path = rad.filename)
        whitelist %>%
          dplyr::select(tidyselect::any_of(want)) %>%
          dplyr::distinct(MARKERS, .keep_all = TRUE) %>%
          readr::write_tsv(x = ., file = whitelist.filename)
      } else {
        whitelist <- suppressWarnings(
          unfiltered.data %>%
            dplyr::filter(!MARKERS %in% blacklist$MARKERS))

        radiator::write_rad(data = whitelist, path = rad.filename)
        whitelist %>%
          dplyr::select(tidyselect::any_of(want)) %>%
          dplyr::distinct(MARKERS, .keep_all = TRUE) %>%
          readr::write_tsv(x = ., file = whitelist.filename)
      }
    }
  }#End bl_map
  split(x = x, f = x$SIGNIFICANCE) %>%
    purrr::walk(
      .x = ., .f = bl_map,
      unfiltered.data = unfiltered.data,
      hw.pop.threshold = hw.pop.threshold,
      path.folder = path.folder)

  return(message("done!"))
}#End blacklist_hw

#' @title update filter parameter file
#' @description update filter parameter file
#' @rdname update_filter_parameter
#' @keywords internal
#' @export
#' @author Thierry Gosselin \email{thierrygosselin@@icloud.com}
update_filter_parameter <- function(filter, unfiltered,
                                    filtered, parameter,
                                    threshold, param.path) {
  # Update filters.parameters SNP ----------------------------------------------
  snp.before <- as.integer(dplyr::n_distinct(unfiltered$MARKERS))
  snp.after <- as.integer(dplyr::n_distinct(filtered$MARKERS))
  snp.blacklist <- snp.before - snp.after

  if (tibble::has_name(unfiltered, "LOCUS")) {
    locus.before <- as.integer(dplyr::n_distinct(unfiltered$LOCUS))
    locus.after <- as.integer(dplyr::n_distinct(filtered$LOCUS))
    locus.blacklist <- locus.before - locus.after
  } else {
    locus.before <- as.character("NA")
    locus.after <- as.character("NA")
    locus.blacklist <- as.character("NA")
  }


  markers.before <- stringi::stri_join(snp.before, locus.before, sep = "/")
  markers.after <- stringi::stri_join(snp.after, locus.after, sep = "/")
  markers.blacklist <- stringi::stri_join(snp.blacklist, locus.blacklist, sep = "/")

  filters.parameters <- tibble::tibble(
    FILTERS = filter,
    PARAMETERS = parameter,
    VALUES = threshold,
    BEFORE = markers.before,
    AFTER = markers.after,
    BLACKLIST = markers.blacklist,
    UNITS = c("SNP/LOCUS"),
    COMMENTS = c("")
  ) %>%
    readr::write_tsv(x = ., file = param.path,
                     append = TRUE, col_names = FALSE)
  return(res = list(filters.parameters, markers.before, markers.blacklist, markers.after))
}#End update_filter_parameter

# testing
# test using nest and purrr map
# test <- data %>% dplyr::group_by(STRATA) %>%
#   tidyr::nest(data = ., .key = "DATA") %>%
#   dplyr::filter(STRATA %in% c("ATL", "MAL")) %>%
#   dplyr::mutate(ANALYSIS = purrr::map(.x = .$DATA, .f = hwe_analysis)) %>%
#   dplyr::select(-DATA) %>%
#   tidyr::unnest(.)

# test using furrr and future
# library(furrr)
# oplan <- future::plan()
# future::plan(multiprocess(workers = parallel.core))
#
# if (dplyr::n_distinct(data$STRATA) > 2 * parallel.core) {
#   future.scheduling <- 2L
# } else {
#   future.scheduling <- 1L
# }
#
# system.time(
#   test <- data %>%
#     dplyr::filter(!STRATA %in% c("OVERALL")) %>%
#     dplyr::group_by(STRATA) %>%
#     tidyr::nest(data = ., .key = "DATA") %>%
#     dplyr::mutate(ANALYSIS = furrr::future_map(
#       .x = .$DATA,
#       .f = hwe_analysis,
#       future.scheduling = future.scheduling)) %>%
#     dplyr::select(-DATA) %>%
#     tidyr::unnest(.)
# )
# on.exit(plan(oplan), add = TRUE)
thierrygosselin/radiator documentation built on May 5, 2024, 5:12 a.m.