R/filter_ld.R

Defines functions filter_ld

Documented in filter_ld

#' @name filter_ld
#' @title GBS/RADseq short and long distance linkage disequilibrium pruning
#' @description SNP short and long distance linkage disequilibrium pruning.
#'
#' What sets appart radiator LD pruning is the RADseq data tailored arguments:
#' \itemize{
#' \item \strong{minimize short linkage disequilibrium (LD)}:
#' 5 values available for \code{filter.short.ld} argument (see below).
#'
#' \item \strong{reduce long distance LD}:
#' Long distance LD pruning is usually advised to avoid capturing the variance LD
#' in PCA analysis.
#'
#' Use the argument \code{filter.long.ld} with values between 0.7 and 0.9 is a
#' good starting point. Ideally, you want to visualize the LD before choosing a threshold.
#'
#'
#' Strategically, run the function with \code{filter.long.ld} argument and
#' refilter the data using the outlier statistic
#' generated by the function (printed on the figure in the output) and using
#' \code{long.ld.missing = TRUE}. This advanced argument will choose the best SNP
#' based on missing data statistics, instead of choosing randomly one SNP
#' (see details).
#'}
#'
#'
#' This function is used internally in \href{https://github.com/thierrygosselin/radiator}{radiator}
#' and might be of interest for users.

#' @inheritParams radiator_common_arguments


#' @param filter.short.ld (character) 5 options (default: \code{filter.short.ld = "mac"}):
#' \enumerate{
#' \item \code{filter.short.ld = "random"} for a random selection of 1 SNP on the read,
#' \item \code{filter.short.ld = "first"} for the first one on the read...,
#' \item \code{filter.short.ld = "last"} for the last SNP on the read and
#' \item \code{filter.short.ld = "middle"} for locus with > 2 SNPs/read the option to select at random
#' one SNP between the first and the last SNP on the read. If the locus as <= 2
#' SNPs on the read, the first one is selected. Note that for that last option,
#' the numbers are reported.
#' \item \code{filter.short.ld = "mac"} will select the SNP on the locus with the maximum global
#' Minor Allele Count (MAC).
#'
#' Using \code{filter.short.ld = NULL}, skip this filter.
#' }
#'
#' @param filter.long.ld (optional, double) The threshold to prune SNP based on
#' Long Distance Linkage Disequilibrium. The argument filter.long.ld is
#' the absolute value of measurement.
#' Default: \code{filter.long.ld = NULL}.

#' @param filename (optional, character) File name prefix for file written in
#' the working directory.
#' Default: \code{filename = NULL}.


#' @export
#' @rdname filter_ld

#' @details The function requires \href{https://github.com/zhengxwen/SNPRelate}{SNPRelate}
#' (see example below on how to install).
#'
#'
#' \strong{Advance mode, using \emph{dots-dots-dots}}
#' \itemize{
#' \item \code{maf.data} (path) this argument is no longer supported.
#' It's a small cost in time in favour of making sure the MAC/MAF fits the actual
#' data.
#'
#' \item \code{long.ld.missing} (logical) With \code{long.ld.missing = TRUE}.
#' The function first generates long distance LD values between markers along
#' the same chromosome or scaffold with \emph{SNPRelate::snpgdsLDMat}.
#' Based on the LD threshold (\code{filter.long.ld}) SNPs in LD will be pruned
#' based on missingness.
#' e.g. if 4 SNPs are in LD, the 1 SNP selected in
#' the end is base on genotyping rate/missingness. If this statistic is equal
#' between the SNPs in LD, 1 SNP is chosen randomly.
#'
#' Using missigness add extra computational time. To speed the analysis when
#' missingness between markers is not an issue, use \code{long.ld.missing = FALSE}.
#' The function will use \emph{SNPRelate::snpgdsLDpruning}
#' to prune the dataset. SNPs in LD are selected randomly.
#' Default: \code{long.ld.missing = FALSE}.
#' \item \code{ld.method}: (optional, character) The values available are
#' \code{"composite"}, for LD composite measure, \code{"r"} for R coefficient
#' (by EM algorithm assuming HWE, it could be negative), \code{"r2"} for r^2,
#' \code{"dprime"} for D',
#' \code{"corr"} for correlation coefficient. The method corr and composite are
#' equivalent when SNPs are coded based on the presence of the alternate allele
#' (\code{0, 1, 2}).
#' Default: \code{ld.method = "r2"}.
#' \item \code{ld.figures}: (logical) Generate long distance LD statistics and
#' figures.
#' Default: \code{ld.figures = TRUE}
#' \item \code{path.folder}: to write ouput in a specific path
#' (used internally in radiator). Default: \code{path.folder = getwd()}.
#' If the supplied directory doesn't exist, it's created.
#' }
#'
#' @return A list in the global environment, with these objects:
#' \enumerate{
#' \item $ld.summary: tibble with LD statistics used for the boxplot
#' \item $ld.boxplot: box plot of LD values
#' \item $whitelist.ld: whitelist of markers kept after filtering for LD.
#' The argument \code{filter.long.ld} must be used to generate the whitelist.
#' \item $blacklist.ld: blacklist of markers prunned during the filtering
#' for LD.
#' The argument \code{filter.long.ld} must be used to generate the blacklist.
#' \item $data: The filtered tidy dataset.
#' \item $gds: the path to the GDS file.
#' }

#' @references Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS.
#' (2012) A high-performance computing toolset for relatedness and principal component
#' analysis of SNP data. Bioinformatics. 28: 3326-3328.
#' doi:10.1093/bioinformatics/bts606

#' @seealso \href{https://github.com/zhengxwen/SNPRelate}{SNPRelate}

#' @examples
#' \dontrun{
#' require(SNPRelate)
#' #To install SNPRelate:
#' install.packages("BiocManager")
#' BiocManager::install ("SNPRelate")
#' library(radiator)
#' data <- radiator::read_vcf(data = "my.vcf", strata = "my.strata.tsv", verbose = TRUE)
#'
#' # short distance LD, no long distance LD:
#' check.short.ld <- radiator::filter_ld(data = data, filter.short.ld = "mac")
#'
#' # short distance LD and long distance LD:
#' pruned.ld <- radiator::filter_ld(
#'         data = data,
#'         filter.short.ld = "mac",
#'         filter.long.ld = 0.8)
#'
#' # short distance LD and long distance LD, incorporating missing data:
#' pruned.ld <- radiator::filter_ld(
#'          data = data, # a GDS object generated by radiator
#'          filter.short.ld = "mac",
#'          filter.long.ld = 0.8,
#'          long.ld.missing = TRUE)
#'
#'
#' }


#' @author Thierry Gosselin \email{thierrygosselin@@icloud.com}


filter_ld <- function(
  data,
  interactive.filter = TRUE,
  filter.short.ld = "mac",
  filter.long.ld = NULL,
  parallel.core = parallel::detectCores() - 1,
  filename = NULL,
  verbose = TRUE,
  ...
) {


  # testing
  # interactive.filter = TRUE
  # data <- gds
  # filter.short.ld = "mac"
  # filter.long.ld = 0.3
  # filename = NULL
  # parallel.core = parallel::detectCores() - 1
  # ld.figures <- TRUE
  # long.ld.missing = TRUE
  # ld.method = "r2"
  # verbose = TRUE
  # path.folder <- NULL
  # parameters <- NULL
  # internal <- FALSE
  # subsample.markers.stats <- NULL
  # obj.keeper <- c(ls(envir = globalenv()), "data")

  if (!interactive.filter && is.null(filter.short.ld) && is.null(filter.long.ld)) {
    return(data)
  }

  if (interactive.filter) verbose <- TRUE
  radiator_function_header(f.name = "filter_ld", verbose = verbose)
  # Cleanup-------------------------------------------------------------------
  file.date <- format(Sys.time(), "%Y%m%d@%H%M")
  if (verbose) message("Execution date@time: ", file.date)
  old.dir <- getwd()
  opt.change <- getOption("width")
  options(width = 70)
  timing <- radiator_tic()
  #back to the original directory and options
  on.exit(setwd(old.dir), add = TRUE)
  on.exit(options(width = opt.change), add = TRUE)
  on.exit(radiator_toc(timing), add = TRUE)
  on.exit(radiator_function_header(f.name = "filter_ld", start = FALSE, verbose = verbose), add = TRUE)
  # on.exit(rm(list = setdiff(ls(envir = sys.frame(-1L)), obj.keeper), envir = sys.frame(-1L)))

  # Function call and dotslist -------------------------------------------------
  rad.dots <- radiator_dots(
    func.name = as.list(sys.call())[[1]],
    fd = rlang::fn_fmls_names(),
    args.list = as.list(environment()),
    dotslist = rlang::dots_list(..., .homonyms = "error", .check_assign = TRUE),
    keepers = c("long.ld.missing", "ld.method", "ld.figures",
                "subsample.markers.stats",
                "path.folder", "parameters", "internal"),
    verbose = FALSE
  )

  # Checking for missing and/or default arguments ------------------------------
  if (missing(data)) rlang::abort("data is missing")

  # Folders---------------------------------------------------------------------
  path.folder <- generate_folder(
    f = path.folder,
    rad.folder = "filter_ld",
    internal = internal,
    file.date = file.date,
    verbose = verbose)

  # write the dots file
  write_rad(
    data = rad.dots,
    path = path.folder,
    filename = stringi::stri_join("radiator_filter_ld_args_", file.date, ".tsv"),
    tsv = TRUE,
    write.message = "Function call and arguments stored in: ",
    internal = internal,
    verbose = verbose
  )

  # interactive.filter steps ---------------------------------------------------
  if (interactive.filter) {
    message("\nInteractive mode: on\n")
    message("Step 1. Short distance LD threshold selection")
    message("Step 2. Filtering markers based on short distance LD")
    message("Step 3. Long distance LD pruning selection")
    message("Step 4. Threshold selection")
    message("Step 5. Filtering markers based on long distance LD\n\n")
  }

  # match arg ------------------------------------------------------------------
  if (!is.null(filter.short.ld)) {
    if (filter.short.ld == "maf") {
      message("\n\nPlease update your code to use filter.short.ld = 'mac'\n\n")
      filter.short.ld <- "mac"
    }
    filter.short.ld <- match.arg(filter.short.ld, c("first", "random", "last", "middle", "mac"))
  }
  if (!is.null(ld.method)) {
    ld.method <- match.arg(ld.method, c("composite", "r", "r2","dprime", "corr"))
  }

  # Filename -------------------------------------------------------------------
  if (is.null(filename)) {
    write.ld <- FALSE
    filename <- stringi::stri_join("radiator_", file.date, ".ld")
  } else {
    write.ld <- TRUE
    filename.problem <- file.exists(filename)
    if (filename.problem) {
      filename <- stringi::stri_join(filename, "_", file.date, ".ld")
    } else {
      filename <- stringi::stri_join(filename, ".ld")
    }
  }

  filename.gds <- stringi::stri_join(filename, ".gds")
  filename.gds <- file.path(path.folder, filename.gds)
  filename <- file.path(path.folder, filename)

  # for tbl_df with long.ld filtering
  filename.gds.rad <- stringi::stri_join(filename.gds, ".rad")

  # Detect format --------------------------------------------------------------
  data.type <- radiator::detect_genomic_format(data)
  if (!data.type %in% c("tbl_df", "fst.file", "SeqVarGDSClass", "gds.file")) {
    rlang::abort("Input not supported for this function: read function documentation")
  }

  # Import data ---------------------------------------------------------------
  if (data.type %in% c("tbl_df", "fst.file")) {
    if (is.vector(data)) data <- radiator::tidy_wide(data = data, import.metadata = FALSE)
    data.type <- "tbl_df"

    wl <- bl <- dplyr::select(data, MARKERS, CHROM, LOCUS, POS) %>%
      dplyr::distinct(MARKERS, .keep_all = TRUE) %>%
      dplyr::arrange(LOCUS, MARKERS)
  }

  # GDS
  if (data.type %in% c("SeqVarGDSClass", "gds.file")) {
    radiator_packages_dep(package = "SeqArray", cran = FALSE, bioc = TRUE)

    if (data.type == "gds.file") {
      data <- radiator::read_rad(data, verbose = verbose)
      data.type <- "SeqVarGDSClass"
    }

    wl <- bl <- extract_markers_metadata(data, whitelist = TRUE) %>%
      dplyr::arrange(LOCUS, MARKERS)
  }

  # radiator_parameters: initiate ------------------------------------------
  filters.parameters <- radiator_parameters(
    generate = TRUE,
    initiate = TRUE,
    update = FALSE,
    parameter.obj = parameters,
    data = data,
    path.folder = path.folder,
    file.date = file.date,
    internal = internal,
    verbose = verbose)

  # Short LD -------------------------------------------------------------------
  if (interactive.filter || !is.null(filter.short.ld)) {
    if (verbose) message("Minimizing short distance LD...")
    locus.stats <- dplyr::group_by(.data = wl, LOCUS) %>%
      dplyr::summarise(SNP_N = n()) %>%
      dplyr::count(SNP_N) %>%
      readr::write_tsv(
        x = ., file = file.path(path.folder, "short.ld.locus.stats.tsv"),
        append = FALSE, col_names = TRUE)

    if (nrow(locus.stats) > 1) {
      range.number.snp.locus <- range(locus.stats$SNP_N, na.rm = TRUE)
      if (verbose) message("    The range in the number of SNP/locus is: ", stringi::stri_join(range.number.snp.locus, collapse = "-"))

      if (interactive.filter) {
        if (verbose) message("\nStep 1. Short distance LD threshold selection")
        if (verbose) message("the goal is to keep only 1 SNP per read/locus")
        # short.ld.fig <- tibble::tibble(
        #   GROUP = c("read", "read", "read", "first", "middle", "last"),
        #   SNP = c(15, 60, 90, 15, 60, 90)) %>%
        #   dplyr::mutate(
        #     GROUP = factor(
        #       x = GROUP,
        #       levels = c("last", "middle", "first", "read"),
        #       ordered = TRUE)
        #   ) %>%
        #   ggplot2::ggplot(
        #     data = .,
        #     ggplot2::aes(x = SNP, y = GROUP, colour = GROUP)) +
        #   ggplot2::geom_point(size = 5, shape = 19) +
        #   ggplot2::geom_segment(ggplot2::aes(x = 1, y = 1, xend = 100, yend = 1), colour = "black", size = 0.3) +
        #   ggplot2::geom_segment(ggplot2::aes(x = 1, y = 2, xend = 100, yend = 2), colour = "black", size = 0.3) +
        #   ggplot2::geom_segment(ggplot2::aes(x = 1, y = 3, xend = 100, yend = 3), colour = "black", size = 0.3) +
        #   ggplot2::geom_segment(ggplot2::aes(x = 1, y = 4, xend = 100, yend = 4), colour = "black", size = 1) +
        #   ggplot2::scale_x_continuous(name = "Read length (bp)",
        #                               breaks = 1:100, limits = c(1, 100)) +
        #   ggplot2::labs(
        #     title = "RAD short linkage disequilibrium filtering options",
        #     y = "filter.short.ld values") +
        #   ggplot2::theme_minimal() +
        #   ggplot2::theme(
        #     # panel.grid.major.x = ggplot2::element_blank(),
        #     plot.title = ggplot2::element_text(size = 12, family = "Helvetica", face = "bold", hjust = 0.5),
        #     # axis.line.x = ggplot2::element_line(colour = "black", size = 2),
        #     axis.title.x = ggplot2::element_text(size = 12, family = "Helvetica", face = "bold"),
        #     axis.text.x = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold", hjust = 1, vjust = 0.5, angle = 90),
        #     axis.title.y = ggplot2::element_blank(),
        #     # axis.title.y = ggplot2::element_text(size = 12, family = "Helvetica", face = "bold"),
        #     # axis.ticks.y = ggplot2::element_blank(),
        #     panel.grid.minor.x = ggplot2::element_blank(),
        #     panel.grid.major.y = ggplot2::element_blank(),
        #     legend.position = "none"
        #     # axis.text.y = ggplot2::element_text(size = 8, family = "Helvetica")
        #   )
        # print(short.ld.fig)
      }
      # Short LD threshold selection -----------------------------------------
      if (interactive.filter) {
        filter.short.ld <- radiator_question(
          x = "Choose the filter.short.ld threshold\nOptions include:\n1: mac (Not sure ? use mac...)\n2: random\n3: first\n4: middle\n5: last",
          answer.opt = c("1", "2", "3", "4", "5"))
        filter.short.ld <- stringi::stri_replace_all_fixed(
          str = filter.short.ld,
          pattern = c("1", "2", "3", "4", "5"),
          replacement = c("mac", "random", "first", "middle", "last"),
          vectorize_all = FALSE)
        filter.short.ld <- match.arg(filter.short.ld, c("mac", "random", "first", "middle", "last"))
      }


      # Random selection ---------------------------------------------------------
      if (verbose) message("\nStep 2. Filtering markers based on short distance LD")
      if (verbose) message("filter.short.ld = ", filter.short.ld)

      if (filter.short.ld == "random") {
        wl %<>%
          dplyr::group_by(LOCUS) %>%
          dplyr::sample_n(tbl = ., size = 1, replace = FALSE) %>%
          dplyr::ungroup(.)
      }#End snp random

      # Fist SNP on the read -----------------------------------------------------
      if (filter.short.ld == "first") {
        wl %<>%
          dplyr::group_by(LOCUS) %>%
          dplyr::summarise(POS = min(POS)) %>%
          dplyr::ungroup(.)
      }#End snp first

      # Last SNP on the read -----------------------------------------------------
      if (filter.short.ld == "last") {
        wl %<>%
          dplyr::group_by(LOCUS) %>%
          dplyr::summarise(POS = max(POS)) %>%
          dplyr::ungroup(.)
      }#End snp last

      # Middle SNP on the read -----------------------------------------------------
      if (filter.short.ld == "middle") {
        snp.locus.prep <- dplyr::group_by(.data = wl, LOCUS) %>%
          dplyr::tally(.) %>%
          dplyr::ungroup(.)

        pick.middle <- snp.locus.prep %>%
          dplyr::filter(n > 2) %>%
          dplyr::select(LOCUS)

        if (nrow(pick.middle) == 0) {
          if (verbose) message("IMPORTANT: the data doesn't have more than 3 SNPs per locus")
          if (verbose) message("    First SNP will be selected instead...")
          wl %<>%
            dplyr::group_by(LOCUS) %>%
            dplyr::summarise(POS = min(POS)) %>%
            dplyr::ungroup(.)

        } else {

          # For locus with <= 2 SNPs/read just keep the first one.
          keep.first <- snp.locus.prep %>%
            dplyr::filter(n <= 2) %>%
            dplyr::select(LOCUS)
          if (verbose) message("    Number of locus with first SNP selected: ", nrow(keep.first))
          keep.first.select <- wl %>%
            dplyr::filter(LOCUS %in% keep.first$LOCUS) %>%
            dplyr::group_by(LOCUS) %>%
            dplyr::summarise(POS = min(POS)) %>%
            dplyr::ungroup(.)

          pick.middle.select <- wl %>%
            dplyr::filter(LOCUS %in% pick.middle$LOCUS) %>%
            dplyr::group_by(LOCUS) %>%
            dplyr::filter(POS != min(POS)) %>% # remove the first SNP
            dplyr::filter(POS != max(POS)) %>% # remove the last SNP
            dplyr::sample_n(tbl = ., size = 1, replace = FALSE) # pick one at random

          if (verbose) message("    Number of locus with random middle SNP selected: ", nrow(pick.middle))
          wl <- dplyr::bind_rows(keep.first.select, pick.middle.select) %>%
            dplyr::arrange(LOCUS, POS)
        }
        pick.middle <- snp.locus.prep <- keep.first.select <- pick.middle.select <- keep.first <- NULL
      }#End snp middle

      # SNP with max MAC on the read -----------------------------------------------
      if (filter.short.ld == "mac") {

        # n.markers <- dplyr::n_distinct(data$MARKERS)
        if (data.type == "tbl_df") {
          # one.snp: markers that doesnt require filtering for MAF/MAC
          # because only 1 SNP/locus
          one.snp <- dplyr::group_by(.data = wl, LOCUS) %>%
            dplyr::tally(.) %>%
            dplyr::filter(n == 1) %>%
            dplyr::left_join(wl, by = "LOCUS") %>%
            dplyr::select(-n) %>%
            dplyr::distinct(MARKERS, .keep_all = TRUE)


          # calculate GLOBAL mac per SNP/LOCUS
          if (rlang::has_name(data, "GT_BIN")) {
            global_mac <- carrier::crate(function(x) {
              `%>%` <- magrittr::`%>%`
              `%<>%` <- magrittr::`%<>%`

              mac.data <- dplyr::group_by(x, MARKERS) %>%
                dplyr::summarise(
                  PP = as.numeric(2 * length(GT_BIN[GT_BIN == 0])),
                  PQ = as.numeric(length(GT_BIN[GT_BIN == 1])),
                  QQ = as.numeric(2 * length(GT_BIN[GT_BIN == 2]))
                ) %>%
                # need this step because seen cases where 2 is not the minor allele...
                dplyr::mutate(
                  PP = PP + PQ,
                  QQ = QQ + PQ,
                  PQ = NULL,
                  MAC_GLOBAL = dplyr::if_else(PP < QQ, PP, QQ),
                  PP = NULL, QQ = NULL) %>%
                dplyr::ungroup(.)
              return(mac.data)
            })#End global_maf

            mac.data <- dplyr::filter(data, !is.na(GT_BIN)) %>%
              dplyr::filter(!MARKERS %in% one.snp$MARKERS)

            if (length(unique(mac.data$MARKERS)) > 30000) {
              mac.data <- radiator_future(
                .x = mac.data,
                .f = global_mac,
                flat.future = "dfr",
                split.vec = TRUE,
                split.with = "MARKERS",
                split.chunks = NULL,
                max.chunks = 10L,
                parallel.core = parallel.core
              )
            } else {
              mac.data <- global_mac(x = mac.data)
            }
          } else {
            mac.data <- data %>%
              dplyr::filter(GT != "000000") %>%
              dplyr::filter(!MARKERS %in% one.snp$MARKERS) %>%
              dplyr::select(MARKERS, INDIVIDUALS, GT) %>%
              dplyr::mutate(
                A1 = stringi::stri_sub(GT, 1, 3),
                A2 = stringi::stri_sub(GT, 4, 6),
                GT = NULL
              ) %>%
              radiator::rad_long(
                x = .,
                cols = c("MARKERS", "INDIVIDUALS"),
                names_to = "ALLELES",
                values_to = "GT",
                variable_factor = FALSE
                ) %>%
              dplyr::count(MARKERS, GT) %>%
              dplyr::group_by(MARKERS) %>%
              dplyr::filter(n == min(n)) %>%
              dplyr::distinct(MARKERS, .keep_all = TRUE) %>%
              dplyr::summarise(MAC_GLOBAL = n, .groups = "drop") %>%
              dplyr::select(MARKERS, MAC_GLOBAL)
          }

          wl %<>%
            dplyr::filter(!MARKERS %in% one.snp$MARKERS) %>%
            dplyr::left_join(mac.data, by = "MARKERS") %>%
            dplyr::group_by(LOCUS) %>%
            dplyr::filter(MAC_GLOBAL == max(MAC_GLOBAL)) %>%
            dplyr::ungroup(.) %>%
            dplyr::select(-MAC_GLOBAL) %>%
            dplyr::distinct(LOCUS, .keep_all = TRUE) %>%
            dplyr::bind_rows(one.snp)
          one.snp <- mac.data <- NULL


        } else {
          n.markers <- nrow(wl)
          if (!rlang::has_name(wl, "MAC_GLOBAL")) {
            wl  %<>%
              dplyr::bind_cols(
                SeqArray::seqAlleleCount(
                  gdsfile = data,
                  ref.allele = NULL,
                  parallel = parallel.core) %>%
                  unlist(.) %>%
                  matrix(
                    data = .,
                    nrow = n.markers, ncol = 2, byrow = TRUE,
                    dimnames = list(rownames = wl$MARKERS,
                                    colnames = c("REF_COUNT", "ALT_COUNT"))) %>%
                  tibble::as_tibble(.)) %>%
              dplyr::mutate(
                # MAC or MAF here it's the same
                MAC_GLOBAL = dplyr::if_else(ALT_COUNT < REF_COUNT, ALT_COUNT, REF_COUNT),
                ALT_COUNT = NULL, REF_COUNT = NULL
              )
          }

          wl  %<>%
            dplyr::group_by(LOCUS) %>%
            dplyr::filter(MAC_GLOBAL == max(MAC_GLOBAL)) %>%
            dplyr::ungroup(.) %>%
            dplyr::distinct(LOCUS, .keep_all = TRUE)
        }
      }#End snp mac

      # Whitelist and Blacklist of markers
      readr::write_tsv(
        x = wl, file = file.path(path.folder, "whitelist.short.ld.tsv"),
        append = FALSE, col_names = TRUE)

      bl %<>%
        # dplyr::setdiff(wl) %>%
        dplyr::filter(!VARIANT_ID %in% wl$VARIANT_ID) %>%
        dplyr::mutate(FILTER = "filter.short.ld")

      readr::write_tsv(
        x = bl,
        file = file.path(path.folder, "blacklist.short.ld.tsv"),
        append = FALSE, col_names = TRUE)
      if (verbose) message("File written: whitelist.short.ld.tsv")
      if (verbose) message("File written: blacklist.short.ld.tsv")

      # filtering data to minimize LD -----------------------------------------
      if (data.type == "tbl_df") {
        data <- dplyr::filter(data, MARKERS %in% wl$MARKERS)
      } else {
        markers.meta <- extract_markers_metadata(gds = data) %>%
          dplyr::mutate(
            FILTERS = dplyr::if_else(VARIANT_ID %in% bl$VARIANT_ID, "filter.short.ld", FILTERS
            )
          )

        # updating the GDS object
        update_radiator_gds(
          gds = data,
          node.name = "markers.meta",
          value = markers.meta,
          sync = TRUE
        )
      }
    } else {
      if (verbose) message("\nThere is no variation in the number of SNP/locus across the data\n")
    }

    # Note to myself:
    # locus.stats: this stats could be written in directory or output in res
    locus.stats <- mac.data <- bl <- wl <- short.ld.fig <- NULL


    # radiator_parameters --------------------------------------------------------
    filters.parameters <- radiator_parameters(
      generate = FALSE,
      initiate = FALSE,
      update = TRUE,
      parameter.obj = filters.parameters,
      data = data,
      filter.name = "Filter short ld",
      param.name = "filter.short.ld",
      values = filter.short.ld,
      path.folder = path.folder,
      file.date = file.date,
      internal = internal,
      verbose = verbose)

    # results short LD -------------------------------------------------------
    radiator_results_message(
      rad.message = stringi::stri_join("\nFilter short ld threshold: ", filter.short.ld),
      filters.parameters,
      internal,
      verbose
    )

  } #End short ld

  # Long distance LD pruning ---------------------------------------------------
  long.ld <- "y"
  if (interactive.filter) {
    long.ld <- radiator_question(
      x = "\nDo you want to continue filtering using long distance ld  ? (y/n):",
      answer.opt = c("y", "n"))
    if (long.ld == "n") {
      return(data)
    }
  }
  if (interactive.filter || !is.null(filter.long.ld)) {
    ref.genome <- detect_ref_genome(data = data, verbose = FALSE)

    if (interactive.filter) {
      message("\nStep 3. Long distance LD pruning selection")
      if (!ref.genome) {
        message("Using missingness to select SNPs in LD is still under construction with de novo data")
        message("Basic pruning using SNPRelate is used")
        # message("\nNote: with de novo data, using LD with missing stats can take longer to compute")
        # message(    "e.g. with 20 000 SNPs, generating the pairwise matrix of LD can take several hours")
        long.ld.missing <- FALSE
      } else {
        message("With a reference genome, pruning is done by chromosome/scaffolds")
        message("Pruning method can randomly choose to keep 1 SNP or")
        message("select the SNP based on missing data statistics")
        long.ld.missing <- radiator_question(
          x = "\nDo you want to use missing data statistics ? (y/n):",
          answer.opt = c("y", "n"))
        if (long.ld.missing == "y") {
          long.ld.missing <- TRUE
        } else {
          long.ld.missing <- FALSE
        }
      }
    }


    if (data.type == "tbl_df") {# for long distance LD pruning
      # Check if data is biallelic
      biallelic <- radiator::detect_biallelic_markers(data = data)
      if (!biallelic) rlang::abort("Long distance LD: biallelic genotypes required")

      # Generating SNPRelate data
      if (verbose) message("\nPreparing the data long LD filtering...")
      # res.ld$data.gds <- radiator::write_gds(
      data.gds <- radiator::write_gds(
        data = data,
        filename = filename,
        verbose = FALSE)

      markers$VARIANT_ID <- markers$MARKERS
      data$VARIANT_ID <- data$MARKERS

      if (verbose) message("SNPRelate GDS file generated: ", filename.gds)
      if (verbose) message("To close the connection use SNPRelate::snpgdsClose(filename)")
    } #End tidy data

    wl <- extract_markers_metadata(data, whitelist = TRUE)
    n.chrom <- length(unique(wl$CHROM))

    if (ref.genome) {
      denovo <- FALSE
      chrom.tick <- dplyr::distinct(wl, CHROM) %>%
        dplyr::mutate(
          CHROM_TICK = stringi::stri_join(seq(from = 1, to = n(), by = 1), n(), sep = "/")
        )

      ld.sample <- dplyr::sample_frac(tbl = chrom.tick, size = 0.2) %>%
        dplyr::select(CHROM) %>%
        purrr::flatten_chr(.)

      wl %<>%
        dplyr::left_join(chrom.tick, by = "CHROM") %>%
        dplyr::mutate(
          LD_SUBSAMPLE = dplyr::if_else(
            CHROM %in% ld.sample, TRUE, FALSE)
        )

      chrom.tick <- chrom.tick$CHROM
      n.markers <- nrow(wl)
      wl %<>%
        dplyr::arrange(CHROM, VARIANT_ID) %>%
        dplyr::mutate(CHROM = factor(x = CHROM, levels = chrom.tick, ordered = TRUE)) %>%
        dplyr::arrange(CHROM)
    } #End ref.genome approach

    # reset bl
    bl <- wl
    ld.sample <- NULL

    # LONG LD with MISSING -----------------------------------------------------
    if (long.ld.missing) {
      if (verbose) message("\nLong distance LD pruning with missing data")

      wl.bl.ld <- ld_missing(
        wl = wl,
        data = data,
        ld.threshold = if (interactive.filter) {
          seq(0.1, 0.9, by = 0.1)
        } else {
          filter.long.ld
        },
        denovo = denovo,
        ld.method = ld.method,
        ld.figures = ld.figures,
        parallel.core = parallel.core,
        verbose = verbose,
        path.folder = path.folder
      )
      # if (verbose) message("Several whitelists and blacklists were generated")

      if (interactive.filter) {
        if (verbose) message("\nStep 4. Threshold selection")
        if (verbose) message("Look at the boxplot, a threshold of 0.2 will blacklist more markers than a threshold of 0.8")
        filter.long.ld <- radiator_question(
          x = "\nEnter the long LD threshold (filter.long.ld threshold, double/proportion):", minmax = c(0,1))

      }

      if (interactive.filter) message("\nStep 5. Filtering markers based on long distance LD")
      wl.bl.ld <- magrittr::extract2(wl.bl.ld, as.name(filter.long.ld))
      # test <- magrittr::extract2(wl.bl.ld, as.name(filter.long.ld))
      wl <- wl.bl.ld %$% wl
      bl <- wl.bl.ld %$% bl

      # updating the GDS object
      if (data.type == "tbl_df") {
        data <- dplyr::filter(data, MARKERS %in% wl$MARKERS)
      } else {
        markers.meta <- extract_markers_metadata(gds = data) %>%
          dplyr::mutate(
            FILTERS = dplyr::if_else(VARIANT_ID %in% bl$VARIANT_ID, "filter.long.ld", FILTERS
            )
          )

        # test <- markers.meta %>%
        #   dplyr::filter(VARIANT_ID %in% bl$VARIANT_ID)

        # updating the GDS object
        update_radiator_gds(
          gds = data,
          node.name = "markers.meta",
          value = markers.meta,
          sync = TRUE
        )
      }

    }# End long.ld.missing

    # Pruning with SNPRelate::snpgdsLDpruning --------------------------------------
    if (!long.ld.missing) {
      # SNPs are randomly selected...
      if (verbose) message("\nLong distance LD pruning WITHOUT missing data stats")

      # if (verbose) message("")
      if (is.null(subsample.markers.stats)) {
        subsample.markers.stats <- 0.2
      } else {
        if (!is.double(subsample.markers.stats)) subsample.markers.stats <- 0.2
      }

      # Boxplot
      bp <- ld_boxplot(
        gds = data,
        subsample.markers.stats = subsample.markers.stats,
        ld.method = ld.method,
        path.folder = path.folder,
        parallel.core = parallel.core,
        verbose = verbose
      )
      bp <- NULL

      if (interactive.filter) {
        if (verbose) message("\nStep 4. Threshold selection")
        if (verbose) message("Look at the boxplot, a threshold of 0.2 blacklist more markers than 0.8")
        filter.long.ld <- radiator_question(
          x = "\nEnter the long LD threshold (filter.long.ld threshold, double/proportion):", minmax = c(0,1))
      }

      if (ld.method == "r2") {
        ld.m <- "r"
        # filter.long.ld <- sqrt(filter.long.ld)
      } else {
        ld.m <- ld.method
      }

      id <- extract_individuals_metadata(
        gds = data,
        ind.field.select = "INDIVIDUALS",
        whitelist = TRUE) %$%
        INDIVIDUALS

      if (verbose) message("Pruning with SNPRelate...")
      timing <- proc.time()# for timing
      wl.variant.id <- SNPRelate::snpgdsLDpruning(
        gdsobj = data,
        snp.id = wl$VARIANT_ID,
        sample.id = id,
        autosome.only = FALSE,
        remove.monosnp = TRUE,
        maf = NaN,
        missing.rate = NaN,
        method = ld.m,
        ld.threshold = filter.long.ld,
        num.thread = 1L,# more not supported
        verbose = TRUE) %>%
        unlist(.)
      timing <- proc.time() - timing
      if (verbose) message("LD pruning computation time: ", round(timing[[3]]), " sec")

      wl.n <- length(wl.variant.id)
      message("Number of markers whitelised: ", wl.n)

      wl %<>% dplyr::filter(VARIANT_ID %in% wl.variant.id)
      bl %<>% dplyr::setdiff(wl) %>% dplyr::mutate(FILTERS = "filter.long.ld")
      write_rad(
        data = wl,
        path = path.folder,
        filename = "whitelist.long.ld.tsv",
        tsv = TRUE,
        internal = FALSE,
        verbose = verbose
      )
      write_rad(
        data = bl,
        path = path.folder,
        filename = "blacklist.long.ld.tsv",
        tsv = TRUE,
        internal = FALSE,
        verbose = verbose
      )
      wl.variant.id <- NULL

      # updating the GDS object
      if (data.type == "tbl_df") {
        data <- dplyr::filter(data, MARKERS %in% wl$MARKERS)
      } else {
        markers.meta <- extract_markers_metadata(gds = data) %>%
          dplyr::mutate(
            FILTERS = dplyr::if_else(
              VARIANT_ID %in% bl$VARIANT_ID, "filter.long.ld", FILTERS
            )
          )

        # updating the GDS object
        update_radiator_gds(
          gds = data,
          node.name = "markers.meta",
          value = markers.meta,
          sync = TRUE
        )
      }
    }# End ld pruning with SNPRelate

    # radiator_parameters --------------------------------------------------------
    filters.parameters <- radiator_parameters(
      generate = FALSE,
      initiate = FALSE,
      update = TRUE,
      parameter.obj = filters.parameters,
      data = data,
      filter.name = "Filter long ld",
      param.name = paste0("filter.long.ld / long.ld.missing"),
      values = stringi::stri_join(filter.long.ld, long.ld.missing,
                                  sep = " / ", ignore_null = FALSE),
      path.folder = path.folder,
      file.date = file.date,
      internal = internal,
      verbose = verbose)

    # results long LD --------------------------------------------------------
    radiator_results_message(
      rad.message = stringi::stri_join("\nFilter long ld threshold: ", filter.long.ld),
      filters.parameters,
      internal,
      verbose
    )
  }#End long distance LD pruning
  return(data)
}#End filter_ld


# Internal nested functions: ---------------------------------------------------
# boxplot of LD using subsampled markers ----------------------------
#' @name ld_boxplot
#' @title ld_boxplot
#' @description Generate a boxplot of LD using subsampled markers.
#' Return the matrix of LD.
#' Write to directory the stats and boxplot.
# @rdname ld_boxplot
#' @export
#' @keywords internal
ld_boxplot <- function(
  gds,
  subsample.markers.stats = 0.2,
  ld.method = "r2",
  path.folder = NULL,
  parallel.core = parallel::detectCores() - 2,
  verbose = TRUE
) {

  if (is.null(path.folder)) path.folder <- getwd()

  variant.id.bk <- extract_markers_metadata(
    gds = gds,
    markers.meta.select = "VARIANT_ID",
    whitelist = TRUE
  ) %$%
    VARIANT_ID

  n.snp <- length(variant.id.bk)
  if (subsample.markers.stats != 1) {
    subsample.markers.n <- floor(subsample.markers.stats * n.snp)
    message("Using a subset of the markers: ", subsample.markers.n)
    if (verbose) message("to change the default proportion: subsample.markers.stats argument")
    variant.sub <- sample(
      x = variant.id.bk,
      size = min(n.snp, subsample.markers.n))
    bp.filename <- "snp.long.ld.boxplot.subsample.pdf"
    ld.summary.filename <- "ld.summary.stats.subsample.tsv"
  } else {
    variant.sub <- variant.id.bk
    bp.filename <- "snp.long.ld.boxplot.pdf"
    ld.summary.filename <- "ld.summary.stats.tsv"
  }
  n.snp.sub <- length(variant.sub)

  id <- extract_individuals_metadata(
    gds = gds,
    ind.field.select = "INDIVIDUALS",
    whitelist = TRUE
  ) %$%
    INDIVIDUALS

  # R2 is handled in radiator, not SNPRelate...
  if (ld.method == "r2") {
    ld.m <- "r"
  } else {
    ld.m <- ld.method
  }

  # SNPRelate doesnt like when lower than number of markers used...
  parallel.core.temp <- max(1L, n.snp.sub)
  if (parallel.core <= parallel.core.temp) {
    parallel.core.temp <- parallel.core
  }

  if (verbose) message("\nLinkage Disequilibrium (LD) estimation on genotypes")
  timing <- proc.time()# for timing
  ld.res <- SNPRelate::snpgdsLDMat(
    gdsobj = gds,
    snp.id = variant.sub,
    sample.id = id,
    slide = -1,
    mat.trim = FALSE,
    method = ld.m,
    num.thread = parallel.core.temp,
    with.id = TRUE,
    verbose = FALSE) %$%
    LD %>%
    magrittr::set_colnames(x = ., variant.sub) %>%
    magrittr::set_rownames(x = ., variant.sub)
  timing <- proc.time() - timing
  if (verbose) message("LD computation time: ", round(timing[[3]]), " sec")

  # this is necessary...
  SeqArray::seqSetFilter(
    object = gds,
    variant.id = variant.id.bk,
    action = "set",
    verbose = FALSE)

  if (verbose) message("Generating statistics for boxplot")
  # Fill with NA the diagonal and the lower triangle...
  ld.res[lower.tri(ld.res, diag = TRUE)] <- rlang::na_dbl

  # check if all missing (yep seen it...)
  all.missing <- all(is.na(ld.res))
  if (all.missing) {
    message("\n\nProblem: LD matrix missing all LD")
    return(ld.res)
  }

  # absolute values
  ld.res <- abs(ld.res)

  if (ld.method == "r2") {
    ld.res <- ld.res^2 #r^2 used here...
  }

  # Blox plot prep
  ld.res.bk <- ld.res # keep a bk and this is the data returned for now
  # ld.res <- ld.res.bk

  # generate a vector
  ld.res <- as.vector(ld.res) %>% magrittr::extract(!is.na(.))
  ld.res <- 1 - ld.res
  # stats
  ld.summary <- tibble_stats(x = ld.res, group = "Long LD")
  outlier.ld <- round(ld.summary$OUTLIERS_HIGH, 2)

  # stats without outlier high
  ld.summary %<>%
    dplyr::bind_rows(
      tibble_stats(
        x = ld.res[ld.res < outlier.ld],
        group = stringi::stri_join("minus outliers high: ", round(ld.summary$OUTLIERS_HIGH, 2))
      )
    )
  ld.res <- NULL

  # write the dots file
  write_rad(
    data = ld.summary,
    path = path.folder,
    filename = ld.summary.filename,
    tsv = TRUE,
    internal = FALSE,
    verbose = verbose
  )

  if (ld.method == "r2") {
    ld.title <- expression(paste("Long distance linkage disequilibrium (", r^2, ")"))
  }

  if (ld.method == "r") {
    ld.title <- "Long distance linkage disequilibrium (r)"
  }

  if (ld.method == "dprime") {
    ld.title <- "Long distance linkage disequilibrium (D')"
  }

  if (ld.method == "corr") {
    ld.title <- "Long distance linkage disequilibrium (corr)"
  }

  if (ld.method == "composite") {
    ld.title <- "Long distance linkage disequilibrium (composite)"
  }

  if (verbose) message("Generating boxplot")

  ld.boxplot <- boxplot_stats(
    data = ld.summary,
    title = "Markers long distance linkage disequilibrium (LD)",
    subtitle = paste0("Outlier high: ", round(ld.summary$OUTLIERS_HIGH, 2)),
    x.axis.title = NULL,
    y.axis.title = ld.title,
    facet.columns = TRUE,
    path.folder = path.folder,
    bp.filename = bp.filename)
  return(ld.res.bk)
}#End ld_boxplot

# melt the LD matrice into a data frame --------------------------------------
#' @name ld2df
#' @title ld2df
#' @description melt the LD matrice into a data frame
# @rdname ld2df
#' @export
#' @keywords internal
ld2df <- function(x) {
  x <- as.matrix(x)
  x <- dplyr::bind_cols(tibble::tibble(MARKERS_A = rownames(x)),
                        tibble::as_tibble(x)) %>%
    radiator::rad_long(
      x = .,
      cols = "MARKERS_A",
      names_to = "MARKERS_B",
      values_to = "LD",
      variable_factor = FALSE
      ) %>%
    dplyr::filter(!is.na(LD)) %>%
    dplyr::arrange(dplyr::desc(LD))
  return(x)
}#End ld2df

# ld_pruning  ------------------------------------------------------------------

#' @name ld_pruning
#' @title Prune dataset based on LD.
#' @description Used internally in \href{https://github.com/thierrygosselin/radiator}{radiator}
#' Prune dataset based on LD.

#' @param ld.tibble (path) The markers LD pairwise data.
#' Default: \code{ld.tibble = NULL}.
#' @param stats (path) The markers missingness info statistics.
#' Default: \code{stats = NULL}.
#' @param ld.threshold (double) The threshold to prune SNPs in LD.
#' Default: \code{ld.threshold = 0.8}.
#' @param verbose (logical, optional) Default: \code{verbose = TRUE}.
#' @return A list with blacklisted SNPs.  Write the blacklist in the working
#' directory.
#' @export
#' @keywords internal
#' @rdname ld_pruning
#' @author Thierry Gosselin \email{thierrygosselin@@icloud.com}

ld_pruning <- function(
  ld.tibble = NULL,
  stats = NULL,
  ld.threshold = 0.8,
  verbose = TRUE
) {
  ld.tibble <- dplyr::filter(ld.tibble, LD > ld.threshold)

  if (nrow(ld.tibble) > 0) {
    markers.ld.list <- tibble::tibble(
      MARKERS = c(ld.tibble$MARKERS_A, ld.tibble$MARKERS_B)) %>%
      dplyr::count(MARKERS) %>%
      # dplyr::group_by(MARKERS) %>%
      # dplyr::tally(.) %>%
      # dplyr::ungroup(.) %>%
      dplyr::arrange(dplyr::desc(n)) %>%
      dplyr::distinct(MARKERS) %>%
      purrr::flatten_chr(.)

    # we want to have them ordered from highest to lowest hence the approach above...
    geno.stats <- dplyr::filter(stats, MARKERS %in% markers.ld.list)
    whitelist.markers <- blacklist.markers <- tibble::tibble(MARKERS = character(0))

    for (i in markers.ld.list) {
      # i <- markers.ld.list[1]
      dups <- dplyr::filter(ld.tibble, MARKERS_A %in% i | MARKERS_B %in% i)
      dups <- sort(unique(c(dups$MARKERS_A, dups$MARKERS_B)))

      # find all duplicates associated with the network
      new.dups <- 0L
      while(length(new.dups) > 0) {
        new.dups <- dplyr::filter(ld.tibble, MARKERS_A %in% dups | MARKERS_B %in% dups)
        new.dups <- sort(unique(c(new.dups$MARKERS_A, new.dups$MARKERS_A)))
        new.dups <- purrr::keep(.x = new.dups, .p = !new.dups %in% dups)
        if (length(new.dups) > 0) {
          dups <- c(dups, new.dups)
        }
      }
      dups <- tibble::tibble(MARKERS = dups)

      if (nrow(blacklist.markers) > 0) {
        dups <- dplyr::filter(dups, !MARKERS %in% blacklist.markers$MARKERS)
      }

      if (nrow(dups) > 0) {
        wm <- dups %>%
          dplyr::left_join(geno.stats, by = "MARKERS") %>%
          dplyr::filter(GENOTYPED_PROP == max(GENOTYPED_PROP)) %>%
          dplyr::sample_n(tbl = ., size = 1) %>% # make sure only 1 is selected
          dplyr::select(MARKERS)

        if (nrow(wm) > 0) whitelist.markers %<>% dplyr::bind_rows(wm)

        bm <- dplyr::filter(dups, !MARKERS %in% wm$MARKERS) %>%
          dplyr::select(MARKERS)

        if (nrow(bm) > 0) blacklist.markers %<>% dplyr::bind_rows(bm)
      }
    }
    dups <- bm <- wm <- i <- new.dups <- NULL
    blacklist.markers <- dplyr::distinct(blacklist.markers, MARKERS)
    whitelist.markers <- NULL
    if (verbose) message("    LD threshold: ", ld.threshold, ", SNPs blacklisted: ", nrow(blacklist.markers))
  } else {
    if (verbose) message("    LD threshold: ", ld.threshold, ", SNPs blacklisted: 0")
    blacklist.markers <- tibble::tibble(MARKERS = character(0))
  }

  blacklist.markers %<>% dplyr::mutate(LD_THRESHOLD = ld.threshold)
  return(blacklist.markers)
} # End ld_pruning


# ld_missing--------------------------------------------------------------------

#' @name ld_missing
#' @title Prune dataset based on LD and missingness.
#' @description Used internally in \href{https://github.com/thierrygosselin/radiator}{radiator}
#' Prune dataset based on LD, uses missingness to keep 1 SNP. This is the function
#' that does it for 1 chrom. It's then used with purrr::map to run seriall on all chrom.
#' @param wl The split dataset by chromosome.
#' @param data The GDS object.
# @param data.type The type of dataset.
#' @param ld.threshold The LD threshold.
#' @param denovo de novo data or not ? (logical).
#' @param ld.method The method to compute LD.
#' @param ld.figures (logical, optional) Default: \code{ld.figures = TRUE}.
#' @param path.folder (character) The path to the folder.
#' @param parallel.core The number of CPU.
#' @param verbose (logical, optional) Default: \code{verbose = TRUE}.
#' @return A list with whitelists and blacklists of markers in LD.
#' @export
#' @keywords internal
#' @author Thierry Gosselin \email{thierrygosselin@@icloud.com}
ld_missing <- function(
  wl,
  data,
  ld.threshold = seq(0.1, 0.9, by = 0.1),
  denovo = NULL,
  ld.method = "r2",
  ld.figures = TRUE,
  path.folder = NULL,
  parallel.core = parallel::detectCores() - 1,
  verbose = TRUE
) {

  if (is.null(denovo)) denovo <- TRUE
  if (is.null(path.folder)) path.folder <- getwd()
  n.chrom <- length(unique(wl$CHROM))

  #Function required------------------------------------------------------------
  ld_m <- function(
    x,
    data,
    denovo = NULL,
    ld.threshold = seq(0.1, 0.9, by = 0.1),
    ld.method = "r2",
    ld.figures = TRUE,
    parallel.core = parallel::detectCores() - 1,
    verbose = TRUE
  ) {
    # x <- chrom.ld[[1]]
    ld.markers <- list()# to store results
    chrom.name <- unique(x$CHROM)
    chrom.snp.n <- length(x$VARIANT_ID)
    chrom.tick <- unique(x$CHROM_TICK)
    if (is.null(denovo)) denovo <- TRUE

    if (verbose && length(chrom.name) == 1) {
      if (denovo) {
        message("Chrom: ", chrom.name, " SNPs number: ", chrom.snp.n, "    (", chrom.tick, ") ", "fake data...")
      } else {
        message("Chrom: ", chrom.name, " SNPs number: ", chrom.snp.n, "    (", chrom.tick, ")")
      }
    }


    # save the filters------------------------------------------------------
    w.m <- SeqArray::seqGetData(gdsfile = data, var.name = "variant.id")
    w.s <- SeqArray::seqGetData(gdsfile = data, var.name = "sample.id")

    all.missing <- FALSE # by default
    if (chrom.snp.n > 1) {
      # Adjusting the parallel.core argument ---------------------------------
      # SNPRelate doesnt like when lower than number of markers used...
      parallel.core.temp <- max(1L, length(x$MARKERS))
      # parallel.core.temp <- max(1L, n.markers)

      if (parallel.core <= parallel.core.temp) {
        parallel.core.temp <- parallel.core
      }

      # R2 is handled in radiator, not SNPRelate...
      if (ld.method == "r2") {
        ld.m <- "r"
      } else {
        ld.m <- ld.method
      }
      if (verbose && denovo) message("Linkage Disequilibrium (LD) estimation on genotypes")
      res.chrom <- SNPRelate::snpgdsLDMat(
        gdsobj = data,
        snp.id = x$VARIANT_ID,
        sample.id = w.s,
        slide = -1,
        mat.trim = FALSE,
        method = ld.m,
        num.thread = parallel.core.temp,
        with.id = TRUE,
        verbose = FALSE) %$%
        LD %>%
        magrittr::set_colnames(x = ., x$MARKERS) %>%
        magrittr::set_rownames(x = ., x$MARKERS)

      # skip <- anyNA(res.chrom)
      # test <- res.chrom[anyNA(res.chrom)]

      # work on the output -------------------------------------------------------
      # Fill with NA the diagonal and the lower triangle...
      res.chrom[lower.tri(res.chrom, diag = TRUE)] <- rlang::na_dbl

      all.missing <- all(is.na(res.chrom))
      if (!all.missing) {

        if (ld.method == "r2") {
          res.chrom <- res.chrom^2 #r^2 used here...
        }
        # Figures ------------------------------------------------------------------
        if (ld.figures && unique(x$LD_SUBSAMPLE)) {
          ld.markers$fig.data <- as.vector(res.chrom) %>% magrittr::extract(!is.na(.))
        }

        if (!denovo) {
          # Generate the missingness stats -----------------------------------------
          markers.missing <- tibble::tibble(
            MARKERS = x$MARKERS,
            MISSING_PROP = SeqArray::seqMissing(
              gdsfile = data,
              per.variant = TRUE,
              parallel = parallel.core.temp
            )
          ) %>%
            dplyr::mutate(GENOTYPED_PROP = 1 - MISSING_PROP, MISSING_PROP = NULL)

          # Pruning the SNPs -------------------------------------------------------
          # These LD values are not used anyway during the pruning
          if (!is.null(ld.threshold)) {
            if (length(ld.threshold) == 1) {
              res.chrom[res.chrom <= ld.threshold] <- rlang::na_dbl
            }
          }
          # remove rows and cols with all missing values
          res.chrom <- res.chrom[rowSums(res.chrom, na.rm = TRUE) > 0, colSums(res.chrom, na.rm = TRUE) > 0, drop = FALSE]

          if (length(res.chrom) >= 1) {
            res.chrom <- ld2df(x = res.chrom)

            if (nrow(res.chrom) > 0) {
              ld.markers$blacklist.markers <- purrr::map_dfr(
                .x = ld.threshold,
                .f = ld_pruning,
                ld.tibble = res.chrom,
                stats = markers.missing
              )

              # test <- ld_pruning(ld.tibble = res.chrom, stats = markers.missing, ld.threshold = 0.9, verbose = TRUE)

              snp.blacklist <- length(ld.markers$blacklist.markers$MARKERS)
            } else {
              if (verbose) message("    SNPs blacklisted: 0")
              ld.markers$blacklist.markers <- NULL
              snp.blacklist <- 0L
            }
          } else {
            if (verbose) message("    SNPs blacklisted: 0")
            ld.markers$blacklist.markers <- NULL
            snp.blacklist <- 0L
          }


          # chrom stats ----------------------------------------------------------------
          n.threshold <- length(ld.threshold)
          if (n.threshold == 1) {
            ld.markers$chrom.ld.stats <- tibble::tibble(
              CHROM = chrom.name,
              NUMBER_SNP = chrom.snp.n,
              NUMBER_SNP_BLACKLISTED = snp.blacklist,
              LD_THRESHOLD = ld.threshold)
          } else if (!is.null(ld.markers$blacklist.markers)) {
            ld.markers$chrom.ld.stats <- ld.markers$blacklist.markers %>%
              dplyr::group_by(LD_THRESHOLD) %>%
              dplyr::tally(.) %>%
              dplyr::mutate(
                CHROM = chrom.name,
                NUMBER_SNP = chrom.snp.n,
                NUMBER_SNP_BLACKLISTED = n,
                n = NULL
              )
          } else {
            ld.markers$chrom.ld.stats <- tibble::tibble(
              CHROM = rep(chrom.name, n.threshold),
              NUMBER_SNP = rep(chrom.snp.n, n.threshold),
              NUMBER_SNP_BLACKLISTED = rep(0L, n.threshold),
              LD_THRESHOLD = ld.threshold)
          }
        }
      }#skip
    }

    if (!denovo) {
      # When number of snp per chrom is 1 or with prob SNPRelate that
      if (all.missing || chrom.snp.n == 1) {
        if (verbose) message("    SNPs blacklisted: 0")
        ld.markers$blacklist.markers <- NULL
        snp.blacklist <- 0L
        # want here
        n.threshold <- length(ld.threshold)
        ld.markers$blacklist.markers <- NULL
        if (n.threshold == 1) {
          ld.markers$chrom.ld.stats <- tibble::tibble(
            CHROM = chrom.name,
            NUMBER_SNP = chrom.snp.n,
            NUMBER_SNP_BLACKLISTED = snp.blacklist,
            LD_THRESHOLD = ld.threshold)
        } else {
          ld.markers$chrom.ld.stats <- tibble::tibble(
            CHROM = rep(chrom.name, n.threshold),
            NUMBER_SNP = rep(1L, n.threshold),
            NUMBER_SNP_BLACKLISTED = rep(0L, n.threshold),
            LD_THRESHOLD = ld.threshold)
        }
      }
    }

    # Reset the filter of the GDS
    SeqArray::seqSetFilter(object = data,
                           variant.id = w.m,
                           sample.id = w.s,
                           verbose = FALSE)
    return(ld.markers)
  }#End ld_m

  if (length(ld.threshold) > 1) {
    if (verbose) message("Computing LD by CHROM/scaffold (n = ", length(unique(wl$CHROM)), "), with several LD thresholds")
  } else {
    if (verbose) message("Computing LD by CHROM/scaffold (n = ", length(unique(wl$CHROM)), "), with LD threshold: ", ld.threshold)
  }

  chrom.ld <- wl %>%
    dplyr::select(CHROM, CHROM_TICK, VARIANT_ID, MARKERS, LD_SUBSAMPLE) %>%
    split(x = ., f = .$CHROM) %>%
    purrr::map(
      .x = .,
      .f = ld_m,
      data = data,
      denovo = denovo,
      ld.threshold = ld.threshold,
      ld.method = ld.method,
      ld.figures = ld.figures,
      parallel.core = parallel.core,
      verbose = verbose)

  # stats
  if (!denovo) {
    chrom.stats <- purrr::map_dfr(.x = chrom.ld, .f = "chrom.ld.stats") %>%
      dplyr::mutate(PROP_BLACKLISTED = round(NUMBER_SNP_BLACKLISTED / NUMBER_SNP, 2))


    readr::write_tsv(
      x = chrom.stats,
      file = file.path(path.folder, "snp.ld.chrom.stats.tsv"))
  }

  # figure: distribution number of SNPs per scaffolds/chrom
  if (!denovo) {
    d.plot <- ggplot2::ggplot(
      data = chrom.stats,
      ggplot2::aes(x = NUMBER_SNP)) +
      ggplot2::geom_histogram(binwidth = round(stats::sd(x = chrom.stats$NUMBER_SNP, na.rm = TRUE))) +
      ggplot2::labs(
        title = "The distribution in the number of SNPs per chromosome/scaffold",
        subtitle = paste0("Number of chromosome/scaffold = ", nrow(chrom.stats), "\nMean number of SNPs/chrom/scaffold = ", round(mean(x = chrom.stats$NUMBER_SNP, na.rm = TRUE))),
        x = "Number of SNPs per chromosome/scaffold",
        y = "Distribution (number of chromosome/scaffold)") +
      ggplot2::theme_bw()+
      ggplot2::theme(
        plot.title = ggplot2::element_text(size = 12, family = "Helvetica", face = "bold", hjust = 0.5),
        plot.subtitle = ggplot2::element_text(size = 10, family = "Helvetica"),
        legend.position = "none",
        axis.title.x = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
        axis.title.y = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
        axis.text.x = ggplot2::element_text(size = 8, family = "Helvetica", angle = 90, hjust = 1, vjust = 0.5),
        strip.text.x = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold")
      ) +
      ggplot2::facet_grid(~ LD_THRESHOLD)
    print(d.plot)
    # save
    ggplot2::ggsave(
      filename = file.path(path.folder, "snp.number.per.chromosome.pdf"),
      plot = d.plot,
      width = 30,
      height = 10,
      dpi = 300,
      units = "cm",
      limitsize = FALSE,
      useDingbats = FALSE
    )

    # Whitelists and blacklists
    ld_wl_bl <- function(x, wl, path.folder, verbose) {

      bl.ld <- purrr::map_dfr(.x = chrom.ld, .f = "blacklist.markers")
      generate_whitelist_ld <- function(bl.ld, wl, path.folder) {
        bl <- wl
        wl %<>% dplyr::filter(!MARKERS %in% bl.ld$MARKERS)
        ld.t <- unique(bl.ld$LD_THRESHOLD)
        filename <- stringi::stri_join("whitelist.long.ld_", ld.t, ".tsv")
        readr::write_tsv(
          x = wl, file = file.path(path.folder, filename),
          append = FALSE, col_names = TRUE)

        bl <- dplyr::setdiff(bl, wl)
        filename <- stringi::stri_join("blacklist.long.ld_", ld.t, ".tsv")
        readr::write_tsv(
          x = bl,
          file = file.path(path.folder, filename),
          append = FALSE, col_names = TRUE)
        res = list(wl = wl, bl = bl)
        return(res)
      }#End generate_whitelist_ld

      wl.ld <- split(x = bl.ld, f = bl.ld$LD_THRESHOLD) %>%
        purrr::map(.x = ., .f = generate_whitelist_ld, wl = wl, path.folder = path.folder)
      if (verbose) message("File written: whitelist(s) and blacklist(s)")
      return(wl.ld)
    }#End ld_wl_bl
    wl.bl.ld <- ld_wl_bl(x = chrom.ld, wl = wl, path.folder = path.folder, verbose = verbose)
  }

  # stats and figures
  if (ld.figures) {
    fig.data <- purrr::map(.x = chrom.ld, .f = "fig.data") %>%
      purrr::flatten_dbl(.)

    ld.summary <- tibble_stats(x = fig.data[!is.na(fig.data)],
                               group = "Long LD")

    if (verbose) message("Generating figures...")
    if (denovo) message("Remember it's fake data...")

    if (ld.method == "r2") {
      ld.title <- expression(paste("Long distance linkage disequilibrium (", r^2, ")"))
    }

    if (ld.method == "r") {
      ld.title <- "Long distance linkage disequilibrium (r)"
    }

    if (ld.method == "dprime") {
      ld.title <- "Long distance linkage disequilibrium (D')"
    }

    if (ld.method == "corr") {
      ld.title <- "Long distance linkage disequilibrium (corr)"
    }

    if (ld.method == "composite") {
      ld.title <- "Long distance linkage disequilibrium (composite)"
    }

    ld.boxplot <- boxplot_stats(
      data = ld.summary,
      title = "Markers long distance linkage disequilibrium (LD)",
      subtitle = paste0("Outlier high: ", round(ld.summary$OUTLIERS_HIGH, 2)),
      x.axis.title = NULL,
      y.axis.title = ld.title,
      path.folder = path.folder,
      bp.filename = "snp.long.ld.boxplot.pdf")
  }#End ld.figures
  if (!denovo) {
    return(wl.bl.ld)
  } else {
    return(NULL)
  }
}#End ld_missing
thierrygosselin/radiator documentation built on May 5, 2024, 5:12 a.m.