R/summary_haplotypes.R

Defines functions summary_haplotypes

Documented in summary_haplotypes

## Summary and tables

#' @title Haplotypes file summary
#' @description STACKS batch_x.haplotypes.tsv file summary.
#' The output of the function is a summary table for populations with:
#' \enumerate{
#' \item assembly artifacts and/or sequencing errors: loci with > 2 alleles.
#' Genotypes with more than 2 alleles are erased before estimating the subsequent statistics.
#' \item consensus reads (reads with no variation/polymorphism throughout the dataset).
#' \item monomorphic loci (at the population and overall level)
#' \item polymorphic loci (at the population and overall level)
#' \item haplotypes statistics for the observed and expected homozygosity and
#' heterozygosity (HOM_O, HOM_E, HET_O, HET_E).
#' \item Gene Diversity within populations (Hs) and averaged over all populations.
#' Not to confuse with Expected Heterozygosity(HET_E).
#' Here, Hs statistic includes a correction for sampling bias stemming from
#' sampling a limited number of individuals per population (Nei, 1987).
#' \item Wright’s inbreeding coefficient (Fis)
#' \item Nei's inbreeding coefficient (Gis) that include a correction for sampling bias.
#' \item \strong{IBDG}: a proxy measure of the realized proportion of the genome
#' that is identical by descent
#' \item \strong{FH measure}: an individual-base estimate that is
#' based on the excess in the observed number of homozygous
#' genotypes within an individual relative to the mean number of homozygous
#' genotypes expected under random mating (Keller et al., 2011; Kardos et al., 2015).
#' \item \strong{Pi}: the nucleotide diversity (Nei & Li, 1979) measured here consider the
#' consensus reads in the catalog (no variation between population sequences).
#' The \code{read.length} argument is used directly in the calculations.
#' To be correctly estimated, the reads obviously need to be of identical size...
#' \item \strong{PRIVATE_HAPLOTYPES}: the number of private haplotypes.
#' }

#' @param data The 'batch_x.haplotypes.tsv' created by STACKS.

#' @param strata A tab delimited file with 2 columns with header:
#' \code{INDIVIDUALS} and \code{STRATA}.
#' The \code{STRATA} column can be any hierarchical grouping.
#' To create a strata file see \href{https://thierrygosselin.github.io/radiator/reference/individuals2strata.html}{individuals2strata}.
#' If you have already run
#' \href{http://catchenlab.life.illinois.edu/stacks/}{stacks} on your data,
#' the strata file is similar to a stacks `population map file`, make sure you
#' have the required column names (\code{INDIVIDUALS} and \code{STRATA}).
#' The strata also works as a whitelist of ids,
#' individuals not in the strata file will be removed from the dataset.

#' @param read.length (number) The length in nucleotide of your reads
#' (e.g. \code{read.length = 100}).


#' @param whitelist.markers (optional) A whitelist of loci with a column name
#' 'LOCUS'. The whitelist is located in the global environment or in the
#' directory (e.g. "whitelist.txt").
#' If a whitelist is used, the files written to the directory will have
#' '.filtered' in the filename.
#' Default: \code{whitelist.markers = NULL}.

#' @param blacklist.id (optional) A blacklist with individual ID and
#' a column header 'INDIVIDUALS'. The blacklist is in the global environment
#' or in the directory (with "blacklist.txt").
#' Default: \code{blacklist.id = NULL}.

#' @param keep.consensus (optional) Using whitelist of markers can automatically
#' remove consensus markers from the nucleotide diversity (Pi) calculations.
#' If you know what you are doing, play with this argument
#' to keep consensus markers.
#' Default: \code{keep.consensus = TRUE}.

#' @param pop.levels (optional, string) This refers to the levels in a factor. In this
#' case, the id of the pop.
#' Use this argument to have the pop ordered your way instead of the default
#' alphabetical or numerical order. e.g. \code{pop.levels = c("QUE", "ONT", "ALB")}
#' instead of the default \code{pop.levels = c("ALB", "ONT", "QUE")}.
#' White spaces in population names are replaced by underscore.
#' Default: \code{pop.levels = NULL}.


#' @param pop.labels (optional, string) Use this argument to rename/relabel
#' your pop or combine your pop. e.g. To combine \code{"QUE"} and \code{"ONT"}
#' into a new pop called \code{"NEW"}:
#' (1) First, define the levels for your pop with \code{pop.levels} argument:
#' \code{pop.levels = c("QUE", "ONT", "ALB")}.
#' (2) then, use \code{pop.labels} argument:
#' \code{pop.levels = c("NEW", "NEW", "ALB")}.
#' To rename \code{"QUE"} to \code{"TAS"}:
#' \code{pop.labels = c("TAS", "ONT", "ALB")}.
#' Default: \code{pop.labels = NULL}. If you find this too complicated,
#' there is also the \code{strata} argument that can do the same thing,
#' see below.
#' White spaces in population names are replaced by underscore.

#' @param artifact.only (optional, logical) With default \code{artifact.only = TRUE},
#' the function will stop after generating blacklists of artifacts and consensus loci
#' (no pi, no FH measures). This is faster and useful to run at the beginning of the pipeline.
#' Run again at the end, with blacklists and whitelists to get reliable estimates of
#' gene diverisity, IBDG and FH.

#' @param parallel.core (optional) The number of core for parallel
#' programming during Pi calculations.
#' Default: \code{parallel.core = parallel::detectCores() - 1}.
#' @return The function returns a list with:
#' \enumerate{
#' \item $consensus.pop # if consensus reads are found
#' \item $consensus.loci # if consensus reads are found
#' \item $artifacts.pop # if artifacts are found
#' \item $artifacts.loci # if artifacts are found
#' \item $individual.summary: the individual's info for FH and Pi
#' \item $summary: the summary statistics per populations and averaged over all pops.
#' \item $private.haplotypes: a tibble with LOCUS, POP_ID and private haplotypes
#' \item $private.haplotypes.summary: a summary of the number of private haplotypes per populations
#' }
#' Also in the list 3 plots (also written in the folder):
#' \enumerate{
#' \item $scatter.plot.Pi.Fh.ind: showing Pi and FH per individuals
#' \item $scatter.plot.Pi.Fh.pop: showing Pi and FH per pop
#' \item $boxplot.pi: showing the boxplot of Pi per pop
#' \item $boxplot.fh: showing the boxplot of FH per pop
#' }
#' use $ to access each #' objects in the list.
#' The function potentially write 4 files in the working directory:
#' blacklist of unique loci with assembly artifacts and/or sequencing errors
#' (per individuals and globally), a blacklist of unique consensus loci
#' and a summary of the haplotype file by population.

#' @examples
#' \dontrun{
#' # The simplest way to run the function:
#' sum <- summary_haplotypes(
#' data = "batch_1.haplotypes.tsv",
#' strata = "strata_brook_charr.tsv",
#' read.length = 90)
#' # if you want pi and fh calculated (ideally on filtered data):
#' sum <- summary_haplotypes(
#' data = "batch_1.haplotypes.tsv",
#' strata = "strata_brook_charr.tsv",
#' read.length = 90,
#' artifact.only = FALSE)
#' }


#' @rdname summary_haplotypes
#' @export

#' @references Keller MC, Visscher PM, Goddard ME (2011)
#' Quantification of inbreeding due to distant ancestors and its detection
#'  using dense single nucleotide polymorphism data. Genetics, 189, 237–249.
#' @references Kardos M, Luikart G, Allendorf FW (2015)
#' Measuring individual inbreeding in the age of genomics: marker-based
#' measures are better than pedigrees. Heredity, 115, 63–72.
#' @references Nei M, Li WH (1979)
#' Mathematical model for studying genetic variation in terms of
#' restriction endonucleases.
#' Proceedings of the National Academy of Sciences of
#' the United States of America, 76, 5269–5273.

#' @author Thierry Gosselin \email{thierrygosselin@@icloud.com} and
#' Anne-Laure Ferchaud \email{annelaureferchaud@@gmail.com}


summary_haplotypes <- function(
  data,
  strata = NULL,
  read.length,
  whitelist.markers = NULL,
  blacklist.id = NULL,
  keep.consensus = TRUE,
  pop.levels = NULL,
  pop.labels = NULL,
  artifact.only = TRUE,
  parallel.core = parallel::detectCores() - 1
) {
  cat("#######################################################################\n")
  cat("#################### stackr::summary_haplotypes #######################\n")
  cat("#######################################################################\n")
  timing <- proc.time()
  opt.change <- getOption("width")
  options(width = 70)
  res <- list() # to store results

  if (artifact.only) {
    message("artifact.only: TRUE
    the function will only output artifacts and consensus blacklists.\n")
  } else {
    message("artifact.only: FALSE
    the function will run and compute all summary statistics, time for coffee...\n")
  }

  if (missing(data)) stop("data argument is missing")
  if (missing(read.length)) stop("read.length argument is required")

  if (!is.null(pop.levels) & is.null(pop.labels)) {
    pop.levels <- stringi::stri_replace_all_fixed(pop.levels, pattern = " ", replacement = "_", vectorize_all = FALSE)
    pop.labels <- pop.levels
  }

  if (!is.null(pop.labels) & is.null(pop.levels)) stop("pop.levels is required if you use pop.labels")

  if (!is.null(pop.labels)) {
    if (length(pop.labels) != length(pop.levels)) stop("pop.levels and pop.labels with different length: check arguments")
    pop.labels <- stringi::stri_replace_all_fixed(pop.labels, pattern = " ", replacement = "_", vectorize_all = FALSE)
  }

  # Date and time --------------------------------------------------------------
  file.date <- format(Sys.time(), "%Y%m%d@%H%M")
  folder.extension <- stringi::stri_join("summary_haplotypes_", file.date, sep = "")
  path.folder <- file.path(getwd(), folder.extension)
  dir.create(path.folder)

  message("Folder created: \n", folder.extension)
  file.date <- NULL #unused object

  # Import and filter blacklist id ---------------------------------------------
  if (!is.null(blacklist.id)) {
    blacklist.id <- suppressMessages(readr::read_tsv(blacklist.id, col_names = TRUE))
    message("Filtering with blacklist of individuals: ", nrow(blacklist.id), " individual(s) blacklisted")
    blacklist.id$INDIVIDUALS <- clean_ind_names(blacklist.id$INDIVIDUALS)
  }

  # STRATA ---------------------------------------------------------------------
  strata.df <- strata_haplo(strata = strata, data = data, blacklist.id = blacklist.id)

  # Import haplotype file ------------------------------------------------------
  message("Importing and tidying STACKS haplotype file: ", data)

  # import header row
  want <- tibble::data_frame(
    INFO = "CATALOG",
    COL_TYPE = "c") %>%
    dplyr::bind_rows(
      dplyr::select(strata.df, INFO = INDIVIDUALS) %>%
        dplyr::mutate(
          COL_TYPE = rep("c", n()),
      INFO = clean_ind_names(INFO)
    ))

  haplo.col.type <- readr::read_tsv(
    file = data,
    n_max = 1,
    na = "-",
    col_names = FALSE,
    col_types = readr::cols(.default = readr::col_character())) %>%
    tidyr::pivot_longer(
      data = .,
      cols = tidyselect::everything(),
      names_to = "DELETE",
      values_to = "INFO"
    ) %>%
    dplyr::mutate(INFO = clean_ind_names(INFO)) %>%
    dplyr::select(-DELETE) %>%
    dplyr::mutate(INFO = clean_ind_names(INFO)) %>%
    dplyr::left_join(want, by = "INFO") %>%
    dplyr::mutate(COL_TYPE = stringi::stri_replace_na(str = COL_TYPE, replacement = "_")) %>%
    dplyr::select(COL_TYPE)

  haplo.col.type[1,1] <- "c"

  haplo.col.type <- purrr::flatten_chr(haplo.col.type) %>% stringi::stri_join(collapse = "")

  # readr now faster/easier than fread...
  haplotype <- readr::read_tsv(
    file = data, col_names = TRUE, na = "-",
    col_types = haplo.col.type)

  colnames(haplotype) <- stringi::stri_replace_all_fixed(
    str = colnames(haplotype),
    pattern = c("# Catalog ID", "Catalog ID", "# Catalog Locus ID"),
    replacement = c("LOCUS", "LOCUS", "LOCUS"), vectorize_all = FALSE)

  if (tibble::has_name(haplotype, "Seg Dist")) {
    haplotype <- dplyr::select(.data = haplotype, -`Seg Dist`)
  }
  haplotype <- tidyr::pivot_longer(
    data = haplotype,
    cols = -LOCUS,
    names_to = "INDIVIDUALS",
    values_to = "HAPLOTYPES"
  )

  haplotype$INDIVIDUALS <- clean_ind_names(haplotype$INDIVIDUALS)

  want <- haplo.col.type <- NULL

  # Import whitelist of markers-------------------------------------------------
  if (is.null(whitelist.markers)) {
    haplotype <- dplyr::mutate(haplotype, WHITELIST = rep("whitelist", n()))
  } else {
    whitelist.markers <- suppressMessages(readr::read_tsv(whitelist.markers, col_names = TRUE))

    # Check that whitelist has "LOCUS" column
    if (!tibble::has_name(whitelist.markers, "LOCUS")) {
      stop("Whitelist requires a LOCUS column, check argument documentation...")
    }
    whitelist.markers <- dplyr::select(.data = whitelist.markers, LOCUS) %>%
      dplyr::mutate(LOCUS = as.character(LOCUS)) %>%
      dplyr::arrange(as.integer(LOCUS))

    haplotype <- haplotype %>%
      dplyr::mutate(
        WHITELIST = dplyr::if_else(LOCUS %in% whitelist.markers$LOCUS,
                                   "whitelist", "blacklist")
      )
    whitelist.markers <- "whitelist"
  }



  # Population levels and strata -----------------------------------------------
  message("Making the haplotype file population-wise...")
  haplotype <- dplyr::left_join(x = haplotype, y = strata.df, by = "INDIVIDUALS")
  strata.df <- NULL

  # using pop.levels and pop.labels info if present
  haplotype <- change_pop_names(data = haplotype, pop.levels = pop.levels, pop.labels = pop.labels)

  pop <- unique(haplotype$POP_ID)

  # Locus with consensus alleles -----------------------------------------------
  message("Scanning for consensus markers...")
  consensus.pop <- haplotype %>%
    dplyr::filter(HAPLOTYPES == "consensus") %>%
    dplyr::distinct(LOCUS, POP_ID) %>%
    dplyr::mutate(CONSENSUS = rep("consensus", times = n())) %>%
    dplyr::arrange(as.numeric(LOCUS), POP_ID)

  if (length(consensus.pop$CONSENSUS) > 0) {
    # Create a list of consensus loci
    blacklist.loci.consensus <- consensus.pop %>%
      dplyr::ungroup(.) %>%
      dplyr::distinct(LOCUS) %>%
      dplyr::arrange(as.numeric(LOCUS))
    message("    Generated a file with ", nrow(blacklist.loci.consensus), " consensus loci")
    message("    Details in: blacklist.loci.consensus.txt")
    readr::write_tsv(
      x = blacklist.loci.consensus,
      path = stringi::stri_join(path.folder, "/blacklist.loci.consensus.txt"),
      col_names = TRUE
    )

    haplotype <- dplyr::mutate(
      haplotype,
      CONSENSUS = dplyr::if_else(
        LOCUS %in% blacklist.loci.consensus$LOCUS,
        "consensus", "not_consensus"))

    blacklist.loci.consensus.sum <- blacklist.loci.consensus %>%
      dplyr::ungroup(.) %>%
      dplyr::summarise(CONSENSUS = n()) %>%
      dplyr::select(CONSENSUS)

    consensus <- consensus.pop %>%
      dplyr::group_by(POP_ID) %>%
      dplyr::summarise(CONSENSUS = dplyr::n_distinct(LOCUS)) %>%
      dplyr::ungroup(.)

    res$consensus.pop <- consensus.pop
    res$consensus.loci <- blacklist.loci.consensus
    consensus.pop <- blacklist.loci.consensus <- NULL
  } else {
    res$consensus.pop <- NULL
    res$consensus.loci <- NULL
    blacklist.loci.consensus.sum <- tibble::data_frame(CONSENSUS = as.integer(0))
    consensus <- tibble::data_frame(
      POP_ID = pop,
      CONSENSUS = as.integer(rep(0, length(pop)))
    )
    haplotype <- dplyr::mutate(haplotype, CONSENSUS = rep("not_consensus", n()))
  }

  # samples per pop
  sample.number <- dplyr::distinct(haplotype, INDIVIDUALS, POP_ID) %>%
    dplyr::group_by(POP_ID) %>%
    dplyr::tally(.) %>%
    dplyr::rename(NUM = n) %>%
    dplyr::ungroup(.) %>%
    tibble::add_row(.data = ., POP_ID = "OVERALL", NUM = sum(.$NUM))

  # Sequencing errors and assembly artifact ------------------------------------
  # Locus with > 2 alleles by individuals
  message("Scanning for assembly artifacts...")
  # haplotype <- haplotype %>%
  #   dplyr::mutate(
  #     ARTIFACTS = stringi::stri_count_fixed(HAPLOTYPES, "/"),
  #     ARTIFACTS = dplyr::if_else(ARTIFACTS > 1, "artifact", "not_artifact"))

  haplotype <- haplotype %>%
    dplyr::mutate(
      POLYMORPHISM = stringi::stri_count_fixed(HAPLOTYPES, "/"),
      POLYMORPHISM = dplyr::if_else(
        POLYMORPHISM > 1, "artifact",
        dplyr::if_else(
          POLYMORPHISM == 1, "het", "hom", missing = "missing")),
      POLYMORPHISM = dplyr::if_else(CONSENSUS == "consensus", "consensus", POLYMORPHISM),
      POLYMORPHISM = dplyr::if_else(stringi::stri_detect_fixed(HAPLOTYPES, "N"),
                                    "artifact", POLYMORPHISM)
    ) %>%
    dplyr::select(POP_ID, INDIVIDUALS, LOCUS, HAPLOTYPES, WHITELIST, POLYMORPHISM)

  # artifacts.ind <- dplyr::filter(
  #   haplotype,
  #   WHITELIST == "whitelist",
  #   CONSENSUS == "not_consensus")

  artifacts.ind <- dplyr::filter(
    haplotype,
    WHITELIST == "whitelist",
    POLYMORPHISM != "consensus")

  total.genotype.number.haplo <- length(stats::na.omit(artifacts.ind$HAPLOTYPES))

  artifacts.ind <- artifacts.ind %>%
    dplyr::filter(POLYMORPHISM == "artifact") %>%
    dplyr::arrange(LOCUS, POP_ID, INDIVIDUALS) %>%
    dplyr::select(LOCUS, POP_ID, INDIVIDUALS, HAPLOTYPES, ARTIFACTS = POLYMORPHISM)


  if (nrow(artifacts.ind) > 0) {
    erased.genotype.number <- length(artifacts.ind$HAPLOTYPES)
    percent.haplo <- paste(round(((erased.genotype.number/total.genotype.number.haplo)*100), 2), "%", sep = " ")

    # Write the list of locus, individuals with artifacts
    if (is.null(whitelist.markers)) {
      filename.artifacts.ind <- "blacklist.loci.artifacts.ind.txt"
    } else {
      filename.artifacts.ind <- "blacklist.loci.artifacts.ind.filtered.txt"
    }
    readr::write_tsv(
      x = artifacts.ind,
      path = stringi::stri_join(path.folder, "/", filename.artifacts.ind),
      col_names = TRUE)
    message("    Generated a file with ",
            nrow(artifacts.ind),
            " artifact(s) genotype(s) by individuals")
    message("    Details in: ", filename.artifacts.ind)

    # res$artifacts.pop <- dplyr::ungroup(artifacts.ind) %>%
    #   dplyr::distinct(LOCUS, POP_ID, ARTIFACTS) %>%
    #   dplyr::arrange(as.numeric(LOCUS), POP_ID)

    res$artifacts.pop <- dplyr::ungroup(artifacts.ind) %>%
      dplyr::distinct(LOCUS, POP_ID, ARTIFACTS) %>%
      dplyr::arrange(as.numeric(LOCUS), POP_ID)

    artifacts <- res$artifacts.pop %>%
      dplyr::group_by(POP_ID) %>%
      dplyr::tally(.) %>%
      dplyr::rename(ARTIFACTS = n)

    res$artifacts.loci <- dplyr::ungroup(res$artifacts.pop) %>%
      dplyr::distinct(LOCUS) %>%
      dplyr::arrange(as.numeric(LOCUS))

    blacklist.loci.artifacts.sum <- res$artifacts.loci %>%
      dplyr::ungroup(.) %>%
      dplyr::summarise(ARTIFACTS = n()) %>%
      dplyr::select(ARTIFACTS)

    # Write the unique list of paralogs blacklisted to a file
    if (is.null(whitelist.markers)) {
      filename.paralogs <- "blacklist.loci.artifacts.txt"
    } else {
      filename.paralogs <- "blacklist.loci.artifacts.filtered.txt"
    }
    readr::write_tsv(
      x = res$artifacts.loci,
      path = stringi::stri_join(path.folder, "/", filename.paralogs),
      col_names = TRUE
    )

    message("    Generated a file with ", nrow(res$artifacts.loci),
            " artifact loci")
    message("    Details in: ", filename.paralogs)

    message("    Out of a total of ", total.genotype.number.haplo, " genotypes")
    message("      ", percent.haplo, " (", erased.genotype.number, ")"," outlier genotypes will be erased")
    haplotype <- suppressWarnings(
      haplotype %>%
        dplyr::mutate(
          HAPLOTYPES = dplyr::if_else(POLYMORPHISM == "artifact", NA_character_, HAPLOTYPES)
        ))
  } else {
    res$artifacts.pop <- res$artifacts.loci <- NULL
    artifacts <- tibble::data_frame(POP_ID = pop, ARTIFACTS = as.integer(rep(0, length(pop))))
    blacklist.loci.artifacts.sum <- tibble::data_frame(ARTIFACTS = as.integer(0))
    # haplotype <- dplyr::mutate(haplotype, ARTIFACTS = rep("not_artifact", n()))
  }
  artifacts.ind <- NULL


  # bind info
  consensus.artifacts <- dplyr::full_join(artifacts, consensus, by = "POP_ID")
  artifacts <- consensus <- NULL

  haplotype.filtered <- haplotype %>%
    dplyr::filter(WHITELIST == "whitelist", POLYMORPHISM != "not_consensus")

  n.individuals <- dplyr::n_distinct(haplotype.filtered$INDIVIDUALS)
  n.markers <- dplyr::n_distinct(haplotype.filtered$LOCUS)

  if (!artifact.only) {
    # FH ------------------------------------------------------------------------
    message("Genome-Wide Identity-By-Descent calculations (FH)...")


    # summary.ind <- dplyr::mutate(
    #   .data = haplotype.filtered,
    #   IND_LEVEL_POLYMORPHISM = dplyr::if_else(
    #     stringi::stri_count_fixed(HAPLOTYPES, "/") == 1,
    #     "het", "hom", missing = "missing"))

    # summary <- dplyr::mutate(
    #   .data = haplotype.filtered,
    #   IND_LEVEL_POLYMORPHISM = dplyr::if_else(
    #     stringi::stri_count_fixed(HAPLOTYPES, "/") == 1,
    #     "het", "hom", missing = "missing"))

    # Some could start averaging by individuals accross markers and then by populations...
    # GenoDive, Hierfstat and Nei computes by locus...

    if (n.markers < 5000) {
      summary.loc.pop <- haplotype.filtered %>%
        dplyr::group_by(LOCUS, POP_ID) %>%
        dplyr::summarise(
          HOM = length(POLYMORPHISM[POLYMORPHISM == "hom"]),
          HET = length(POLYMORPHISM[POLYMORPHISM == "het"]),
          N_GENOT = HOM + HET,
          HOM_O = HOM / N_GENOT,
          HET_O = HET / N_GENOT
        ) %>%
        dplyr::mutate(
          NN = 2 * N_GENOT,
          N_INV = N_GENOT / (N_GENOT - 1)
        ) %>%
        dplyr::ungroup(.)
    } else {

      split.vec <- dplyr::distinct(haplotype.filtered, LOCUS) %>%
        dplyr::mutate(SPLIT_VEC = as.integer(floor((parallel.core * 3 * (1:n.markers - 1) / n.markers) + 1)))

      # split_vec_row

      ## Need to upgrade to split_tibble_rows
      # summary.loc.pop <- split_tibble_rows()


      summary.loc.pop <- haplotype.filtered %>%
        dplyr::left_join(split.vec, by = "LOCUS") %>%
        split(x = ., f = .$SPLIT_VEC) %>%
        .stackr_parallel(
          X = .,
          FUN = loc_pop_stats,
          mc.cores = parallel.core
        ) %>%
        dplyr::bind_rows(.)

      split.vec <- NULL
    }

    # HS Gene Diversity within populations ---------------------------------------
    # not the same as Expected Heterozygosity

    # haplotype.filtered.genotyped <- haplotype.filtered %>% dplyr::filter(!is.na(HAPLOTYPES))

    n.row <- nrow(haplotype.filtered)
    split.vec <- as.integer(floor((parallel.core * 20 * (1:n.row - 1) / n.row) + 1))
    n.row <- NULL

    # split.vec.locus <- dplyr::distinct(haplotype.filtered, LOCUS) %>%
    #   dplyr::mutate(
    #     SPLIT_VEC = factor(as.integer(floor((parallel.core * 20 * (1:n() - 1) / n()) + 1)))
    #   )

    haplotype.filtered.split <- split(x = haplotype.filtered, f = split.vec) %>%
      .stackr_parallel_mc(
        X = ., FUN = separate_haplo_long, mc.cores = parallel.core) %>%
      dplyr::bind_rows(.)

    hs <- haplotype.filtered.split %>%
      dplyr::group_by(LOCUS, POP_ID) %>%
      dplyr::count(ALLELES) %>%
      dplyr::summarise(HOM_E = sum((n / sum(n))^2)) %>%
      dplyr::ungroup(.) %>%
      dplyr::full_join(dplyr::select(summary.loc.pop, LOCUS, POP_ID, HET_O, NN, N_INV), by = c("LOCUS", "POP_ID")) %>%
      dplyr::group_by(LOCUS, POP_ID) %>%
      dplyr::mutate(HS = N_INV * (1 - HOM_E - HET_O / NN)) %>%
      dplyr::ungroup(.)

    summary.pop <- summary.loc.pop %>%
      dplyr::group_by(POP_ID) %>%
      dplyr::summarise(
        HOM_O = mean(HOM_O, na.rm = TRUE),
        HET_O = mean(HET_O, na.rm = TRUE)
      ) %>%
      dplyr::ungroup(.) %>%
      dplyr::full_join(
        dplyr::group_by(.data = hs, POP_ID) %>%
          dplyr::summarise(
            HOM_E = mean(HOM_E, na.rm = TRUE),
            HET_E = (1 - HOM_E),
            HS = mean(HS, na.rm = TRUE)
            ) %>%
          dplyr::ungroup(.)
        , by = "POP_ID")

    summary.locus <- summary.loc.pop %>%
      dplyr::group_by(LOCUS) %>%
      dplyr::summarise(
        HOM_O = mean(HOM_O, na.rm = TRUE),
        HET_O = mean(HET_O, na.rm = TRUE)
      ) %>%
      dplyr::ungroup(.) %>%
      dplyr::full_join(
        dplyr::group_by(.data = hs, LOCUS) %>%
          dplyr::summarise(
            HOM_E = mean(HOM_E, na.rm = TRUE),
            HET_E = (1 - HOM_E),
            HS = mean(HS, na.rm = TRUE)
            ) %>%
          dplyr::ungroup(.)
        , by = "LOCUS")

    hs <- summary.loc.pop <- NULL

    summary.ind <- haplotype.filtered %>%
      dplyr::group_by(INDIVIDUALS, POP_ID) %>%
      dplyr::summarise(
        HOM = length(POLYMORPHISM[POLYMORPHISM == "hom"]),
        HET = length(POLYMORPHISM[POLYMORPHISM == "het"]),
        N_GENOT = HOM + HET,
        HOM_O = HOM / N_GENOT,
        HET_O = HET / N_GENOT
      ) %>%
      dplyr::ungroup(.)

    haplotype.filtered <- NULL

    # required functions
    # freq_hom <- function(x) {
    #   res <- dplyr::select(x, -SPLIT_VEC) %>%
    #     dplyr::group_by(LOCUS, POP_ID, ALLELES) %>%
    #     dplyr::summarise(
    #       FREQ_ALLELES = length(ALLELES)/mean(DIPLO),
    #       HOM_E = FREQ_ALLELES * FREQ_ALLELES
    #     ) %>%
    #     dplyr::select(-FREQ_ALLELES) %>%
    #     dplyr::ungroup(.)
    #   return(res)
    # }#End freq_hom

    # freq.hom.pop <- haplotype.filtered %>%
    #   dplyr::filter(!is.na(HAPLOTYPES))

    # diplo <- freq.hom.pop %>%
    #   dplyr::distinct(LOCUS, POP_ID, INDIVIDUALS) %>%
    #   dplyr::group_by(LOCUS, POP_ID) %>%
    #   dplyr::tally(.) %>%
    #   dplyr::ungroup(.) %>%
    #   dplyr::mutate(n = 2 * n) %>%
    #   dplyr::rename(DIPLO = n)

    # n.row <- nrow(haplotype.filtered.split)
    # split.vec <- as.integer(floor((parallel.core * 20 * (1:n.row - 1) / n.row) + 1))
    # n.row <- NULL

    # split.vec.locus <- dplyr::distinct(haplotype.filtered.split, LOCUS) %>%
    #   dplyr::mutate(
    #     SPLIT_VEC = factor(as.integer(floor((parallel.core * 20 * (1:n() - 1) / n()) + 1)))
    #   )

    # # freq.hom.pop <- split(x = freq.hom.pop, f = split.vec) %>%
    # #   .stackr_parallel_mc(
    # #     X = ., FUN = separate_haplo, mc.cores = parallel.core) %>%
    # #   dplyr::bind_rows(.) %>%
    # #   dplyr::left_join(diplo, by = c("LOCUS", "POP_ID")) %>%
    # #   dplyr::left_join(split.vec.locus, by = "LOCUS") %>%
    # #   split(x = ., f = .$SPLIT_VEC) %>%
    # #   .stackr_parallel_mc(
    # #     X = ., FUN = freq_hom, mc.cores = parallel.core) %>%
    # #   dplyr::bind_rows(.) %>%
    # #   dplyr::group_by(LOCUS, POP_ID) %>%
    # #   dplyr::summarise(HOM_E = sum(HOM_E)) %>%
    # #   dplyr::group_by(POP_ID) %>%
    # #   dplyr::summarise(HOM_E = mean(HOM_E))
    #
    #
    # freq.hom.pop <- hs %>%
    #   dplyr::group_by(POP_ID) %>%
    #   dplyr::summarise(HOM_E = mean(HOM_E))

    split.vec <- NULL
    # diplo <- split.vec.locus <- split.vec <- NULL

    # IBDg with FH ---------------------------------------------------------------
    # ok so here we need the info by id

    # fh.i <- suppressWarnings(
    #   summary.ind %>%
    #     dplyr::full_join(freq.hom.pop, by = "POP_ID") %>%
    #     dplyr::mutate(FH = ((HOM_O - HOM_E)/(N_GENOT - HOM_E))) %>%
    #     dplyr::select(INDIVIDUALS, POP_ID, FH)
    # )

    res$individual.summary <- suppressWarnings(
      summary.ind %>%
        dplyr::full_join(dplyr::select(summary.pop, -HOM_O, -HET_O), by = "POP_ID") %>%
        dplyr::mutate(FH = ((HOM_O - HOM_E)/(N_GENOT - HOM_E))) %>%
        dplyr::select(INDIVIDUALS, POP_ID, FH)
    )
    summary.ind <- NULL

    summary.pop <- summary.pop %>%
      dplyr::full_join(
        dplyr::group_by(.data = res$individual.summary, POP_ID) %>% dplyr::summarise(FH = mean(FH, na.rm = TRUE))
        , by = "POP_ID") %>%
      dplyr::mutate(
        FIS = dplyr::if_else(
          HET_O == 0, 0, round(((HET_E - HET_O) / HET_E), 6)),
        GIS = dplyr::if_else(
          HET_O == 0, 0, round(1 - HET_O / HS, 6))
      ) %>%
      dplyr::select(POP_ID, HOM_O, HOM_E, HET_O, HET_E, HS, FIS, GIS, FH)

    summary.locus <- summary.locus %>%
      dplyr::mutate(
        FIS = dplyr::if_else(
          HET_O == 0, 0, round(((HET_E - HET_O) / HET_E), 6)),
        GIS = dplyr::if_else(
          HET_O == 0, 0, round(1 - HET_O / HS, 6))
      )

    summary.locus <- NULL

    summary.overall <- suppressWarnings(dplyr::full_join(
      sample.number,
      dplyr::bind_rows(
        summary.pop,
        dplyr::summarise_if(
          .tbl = summary.pop, .predicate = is.numeric, .funs = mean, na.rm = TRUE) %>%
          tibble::add_column(.data = ., POP_ID = "OVERALL", .before = 1)
      ), by = "POP_ID"))

    sample.number <- summary.pop <- NULL
    # fh.tot <- fh.i %>%
    #   dplyr::summarise(
    #     HOM_O = round(mean(HOM_O), 6),
    #     HOM_E = round(mean(HOM_E), 6),
    #     HET_O = round(mean(1 - HOM_O), 6),
    #     HET_E = round(mean(1 - HOM_E), 6),
    #     FIS = ifelse(HET_O == 0, 0, round(((HET_E - HET_O) / HET_E), 6)),
    #     FH = mean(FH)
    #   )

    # fh.tot <- tibble::data_frame(POP_ID = "OVERALL") %>%
    #   dplyr::bind_cols(fh.tot)

    # fh.res <- suppressWarnings(dplyr::bind_rows(fh.pop, fh.tot) %>% dplyr::select(-POP_ID))
    res$summary <- suppressWarnings(
      dplyr::bind_rows(
        haplotype.filtered.split %>%
          dplyr::distinct(LOCUS, POP_ID, ALLELES) %>%
          dplyr::group_by(LOCUS, POP_ID) %>%
          dplyr::summarise(ALLELES_COUNT = length(ALLELES)) %>%
          dplyr::group_by(POP_ID) %>%
          dplyr::summarise(
            MONOMORPHIC = length(ALLELES_COUNT[ALLELES_COUNT == 1]),
            POLYMORPHIC = length(ALLELES_COUNT[ALLELES_COUNT >= 2])
          ) %>%
          dplyr::full_join(consensus.artifacts, by = "POP_ID"),
        haplotype.filtered.split %>%
          dplyr::distinct(LOCUS, ALLELES) %>%
          dplyr::group_by(LOCUS) %>%
          dplyr::summarise(ALLELES_COUNT_OVERALL = length(ALLELES)) %>%
          dplyr::ungroup(.) %>%
          dplyr::summarise(
            MONOMORPHIC = length(ALLELES_COUNT_OVERALL[ALLELES_COUNT_OVERALL == 1]),
            POLYMORPHIC = length(ALLELES_COUNT_OVERALL[ALLELES_COUNT_OVERALL >= 2])
          ) %>%
          dplyr::bind_cols(blacklist.loci.consensus.sum) %>%
          dplyr::bind_cols(blacklist.loci.artifacts.sum) %>%
          tibble::add_column(.data = ., POP_ID = "OVERALL", .before = 1)))


    if (keep.consensus) {
      res$summary <- res$summary %>%
        dplyr::group_by(POP_ID) %>%
        dplyr::mutate(
          TOTAL = MONOMORPHIC + POLYMORPHIC + CONSENSUS + ARTIFACTS,
          MONOMORPHIC_PROP = round(MONOMORPHIC / TOTAL, 4),
          POLYMORPHIC_PROP = round(POLYMORPHIC / TOTAL, 4),
          CONSENSUS_PROP = round(CONSENSUS / TOTAL, 4),
          ARTIFACTS_PROP = round(ARTIFACTS / TOTAL, 4)
        )
    } else {
      res$summary <- res$summary %>%
        dplyr::select(-CONSENSUS) %>%
        dplyr::group_by(POP_ID) %>%
        dplyr::mutate(
          TOTAL = MONOMORPHIC + POLYMORPHIC + ARTIFACTS,
          MONOMORPHIC_PROP = round(MONOMORPHIC / TOTAL, 4),
          POLYMORPHIC_PROP = round(POLYMORPHIC / TOTAL, 4),
          ARTIFACTS_PROP = round(ARTIFACTS / TOTAL, 4)
        )
    }
    res$summary <- suppressWarnings(
      dplyr::full_join(res$summary, summary.overall, by = "POP_ID"))

    consensus.artifacts <- blacklist.loci.consensus.sum <- blacklist.loci.artifacts.sum <- summary.overall <- NULL

    # Private haplotypes -------------------------------------------------------
    haplotype.filtered.split <- haplotype.filtered.split %>%
      dplyr::select(LOCUS, POP_ID, HAPLOTYPES = ALLELES) %>%
      dplyr::filter(!is.na(HAPLOTYPES)) %>%
      dplyr::distinct(LOCUS, POP_ID, HAPLOTYPES)

    res$private.haplotypes <- haplotype.filtered.split %>%
      dplyr::group_by(LOCUS, HAPLOTYPES) %>%
      dplyr::tally(.) %>%
      dplyr::filter(n == 1) %>%
      dplyr::select(-n) %>%
      dplyr::ungroup(.) %>%
      dplyr::left_join(haplotype.filtered.split, by = c("LOCUS", "HAPLOTYPES")) %>%
      dplyr::select(LOCUS, POP_ID, HAPLOTYPES) %>%
      readr::write_tsv(x = ., path = file.path(path.folder, "private.haplotypes.tsv"))

    res$private.haplotypes.summary <- res$private.haplotypes %>%
      dplyr::group_by(POP_ID) %>%
      dplyr::tally(.) %>%
      dplyr::ungroup(.) %>%
      tibble::add_row(.data = ., POP_ID = "OVERALL", n = sum(.$n)) %>%
      dplyr::mutate(PRIVATE_HAPLOTYPES = stringi::stri_join(POP_ID, " = ", n)) %>%
      readr::write_tsv(x = ., path = file.path(path.folder, "private.haplotypes.summary.tsv"))

    message("Number of private haplotypes = ", nrow(res$private.haplotypes))
    message("Strata with the highest number of private haplotypes = ", res$private.haplotypes.summary$POP_ID[res$private.haplotypes.summary$n == max(res$private.haplotypes.summary$n)])
    message("Number of private haplotype(s) per strata:\n", stringi::stri_join(res$private.haplotypes.summary$PRIVATE_HAPLOTYPES, collapse = "\n"))

    haplotype.filtered.split <- NULL
    # Nei & Li 1979 Nucleotide Diversity ---------------------------------------
    message("Nucleotide diversity (Pi):")
    message("    Read length used: ", read.length)

    # Pi: by individuals--------------------------------------------------------
    message("    Pi calculations: individuals...")
    if (keep.consensus) {
      pi.data <- dplyr::filter(
        haplotype,
        !is.na(HAPLOTYPES),
        WHITELIST == "whitelist" | !POLYMORPHISM %in% "artifact")
    } else {
      pi.data <- dplyr::filter(
        haplotype,
        !is.na(HAPLOTYPES),
        WHITELIST == "whitelist",
        !POLYMORPHISM %in% c("artifact", "consensus"))
    }

    n.row <- nrow(pi.data)
    split.vec <- as.integer(floor((parallel.core * 20 * (1:n.row - 1) / n.row) + 1))
    n.row <- NULL

    pi.data <- split(x = pi.data, f = split.vec) %>%
      .stackr_parallel_mc(
        X = ., FUN = separate_haplo, mc.cores = parallel.core) %>%
      dplyr::bind_rows(.)

    split.vec <- NULL

    res$individual.summary <- dplyr::full_join(
      res$individual.summary,
      pi.data %>%
        dplyr::mutate(
          PI = (stringdist::stringdist(a = ALLELE1, b = ALLELE2, method = "hamming")) / read.length
        ) %>%
        dplyr::group_by(INDIVIDUALS) %>%
        dplyr::summarise(PI = mean(PI, na.rm = TRUE))
      , by = "INDIVIDUALS")


    # Pi: by pop------------------------------------------------------------------
    message("    Pi calculations: populations...")
    pi.data <- tidyr::pivot_longer(
        data = pi.data,
        cols = -c("LOCUS", "INDIVIDUALS", "POP_ID"),
        names_to = "ALLELE_GROUP",
        values_to = "ALLELES"
      )

    pi.pop <- pi.data %>%
      split(x = ., f = .$POP_ID) %>%
      purrr::map_df(
        .x = ., .f = pi_pop,
        read.length = read.length, parallel.core = parallel.core
      )

    # Pi: overall  ---------------------------------------------------------------
    message("    Pi calculations: overall")
    pi.overall <- split(x = pi.data, f = pi.data$LOCUS)
    pi.data <- NULL
    pi.overall <- .stackr_parallel(
      X = pi.overall,
      FUN = pi,
      mc.cores = parallel.core,
      read.length = read.length
    ) %>%
      dplyr::bind_rows(.) %>%
      dplyr::summarise(PI_NEI = mean(PI, na.rm = TRUE)) %>%
      tibble::add_column(.data = ., POP_ID = "OVERALL", .before = "PI_NEI")

    # Combine the pop and overall data -------------------------------------------
    pi.overall <- suppressWarnings(
      dplyr::bind_rows(pi.pop, pi.overall))
    pi.pop <- NULL
    res$summary <- suppressWarnings(dplyr::full_join(res$summary, pi.overall, by = "POP_ID")) %>%
      dplyr::ungroup(.)
    pi.overall <- NULL

    res$summary <- tibble::add_column(
      .data = res$summary,
      PRIVATE_HAPLOTYPES = res$private.haplotypes.summary$n)

    if (is.null(whitelist.markers)) {
      filename.sum <- "haplotype.catalog.loci.summary.pop.tsv"
    } else {
      filename.sum <- "haplotype.catalog.loci.summary.pop.filtered.tsv"
    }
    readr::write_tsv(x = res$summary, path = stringi::stri_join(path.folder, "/",filename.sum))

    # Figures --------------------------------------------------------------------
    res$scatter.plot.Pi.Fh.ind <- dplyr::filter(res$individual.summary, POP_ID != "OVERALL") %>%
      ggplot2::ggplot(data = ., ggplot2::aes(x = FH, y = PI)) +
      ggplot2::geom_point(ggplot2::aes(colour = POP_ID)) +
      ggplot2::stat_smooth(method = stats::lm, level = 0.95, fullrange = FALSE, na.rm = TRUE) +
      ggplot2::labs(x = "Individual IBDg (FH)") +
      ggplot2::labs(y = "Individual nucleotide diversity (Pi)") +
      ggplot2::theme(
        axis.title.x = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
        axis.title.y = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
        legend.title = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
        legend.text = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
        strip.text.y = ggplot2::element_text(angle = 0, size = 10, family = "Helvetica", face = "bold"),
        strip.text.x = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold")
      )
    ggplot2::ggsave(
      filename = stringi::stri_join(path.folder, "/scatter.plot.Pi.Fh.ind.pdf"),
      plot = res$scatter.plot.Pi.Fh.ind,
      width = 20, height = 15,
      dpi = 600, units = "cm", useDingbats = FALSE)

    res$scatter.plot.Pi.Fh.pop <- dplyr::filter(res$summary, POP_ID != "OVERALL") %>%
      ggplot2::ggplot(data = ., ggplot2::aes(x = FH, y = PI_NEI)) +
      ggplot2::geom_point(ggplot2::aes(colour = POP_ID)) +
      ggplot2::stat_smooth(method = stats::lm, level = 0.95, fullrange = FALSE, na.rm = TRUE) +
      ggplot2::labs(x = "Individual IBDg (FH)") +
      ggplot2::labs(y = "Individual nucleotide diversity (Pi)") +
      ggplot2::theme(
        axis.title.x = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
        axis.title.y = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
        legend.title = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
        legend.text = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
        strip.text.y = ggplot2::element_text(angle = 0, size = 10, family = "Helvetica", face = "bold"),
        strip.text.x = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold")
      )
    ggplot2::ggsave(
      filename = stringi::stri_join(path.folder, "/scatter.plot.Pi.Fh.pop.pdf"),
      plot = res$scatter.plot.Pi.Fh.pop,
      width = 20, height = 15,
      dpi = 600, units = "cm", useDingbats = FALSE)


    res$boxplot.pi <- dplyr::filter(res$individual.summary, POP_ID != "OVERALL") %>%
      ggplot2::ggplot(data = ., ggplot2::aes(x = factor(POP_ID), y = PI, na.rm = TRUE)) +
      ggplot2::geom_violin(trim = F) +
      ggplot2::geom_boxplot(width = 0.1, fill = "black", outlier.colour = NA) +
      ggplot2::stat_summary(fun = "mean", geom = "point", shape = 21, size = 2.5, fill = "white") +
      ggplot2::labs(x = "Sampling sites") +
      ggplot2::labs(y = "Individual nucleotide diversity (Pi)") +
      ggplot2::theme(
        legend.position = "none",
        axis.title.x = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
        axis.title.y = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
        legend.title = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
        legend.text = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
        strip.text.x = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold")
      )
    ggplot2::ggsave(
      filename = stringi::stri_join(path.folder, "/boxplot.pi.pdf"),
      plot = res$boxplot.pi,
      width = 20, height = 15,
      dpi = 600, units = "cm", useDingbats = FALSE)

    res$boxplot.fh <- dplyr::filter(res$individual.summary, POP_ID != "OVERALL") %>%
      ggplot2::ggplot(data = ., ggplot2::aes(x = factor(POP_ID), y = FH, na.rm = T)) +
      ggplot2::geom_violin(trim = F) +
      ggplot2::geom_boxplot(width = 0.1, fill = "black", outlier.colour = NA) +
      ggplot2::stat_summary(fun = "mean", geom = "point", shape = 21, size = 2.5, fill = "white") +
      ggplot2::labs(x = "Sampling sites") +
      ggplot2::labs(y = "Individual IBDg (FH)") +
      ggplot2::theme(
        legend.position = "none",
        axis.title.x = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
        axis.title.y = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
        legend.title = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
        legend.text = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
        strip.text.x = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold")
      )
    ggplot2::ggsave(
      filename = stringi::stri_join(path.folder, "/boxplot.fh.pdf"),
      plot = res$boxplot.fh,
      width = 20, height = 15,
      dpi = 600, units = "cm", useDingbats = FALSE)
  }
  # results --------------------------------------------------------------------
  cat("############################### RESULTS ###############################\n")
  message("Number of populations: ", length(pop))
  message("Number of individuals: ", n.individuals)
  message("Number of loci in the catalog: ", n.markers)
  if (!is.null(res$artifacts.pop)) {
    message("    impacted by assembly artifacts and/or sequencing errors (loci > 2 alleles) = ", dplyr::n_distinct(res$artifacts.pop$LOCUS))
  }
  message("    with consensus alleles = ", dplyr::n_distinct(res$consensus.pop$LOCUS))
  if (!is.null(res$artifacts.pop)) {
    message("Number of artefactual genotypes/total genotypes (percent): ", erased.genotype.number, "/", total.genotype.number.haplo, " (", percent.haplo, ")")
  }
  message("\nFiles written: ")
  if (!is.null(res$artifacts.pop)) message(filename.artifacts.ind)
  if (!is.null(res$artifacts.pop)) message(filename.paralogs)
  if (!artifact.only) message(filename.sum)
  if (!is.null(res$consensus.loci)) message("blacklist.loci.consensus.txt")
  if (!artifact.only) message("private.haplotypes.tsv")
  if (!artifact.only) message("private.haplotypes.summary.tsv")
  timing <- proc.time() - timing
  message("\nComputation time: ", round(timing[[3]]), " sec")
  cat("############################## completed ##############################\n")
  options(width = opt.change)
  return(res)
}


# Internal nested Function -----------------------------------------------------

#' @title strata_haplo
#' @description Manage strata
#' @rdname strata_haplo
#' @keywords internal
#' @export
strata_haplo <- function(strata = NULL, data = NULL, blacklist.id = NULL) {

  if (is.null(strata)) {
    message("No strata file provided")
    message("    generating a strata with 1 grouping")
    if (is.null(data)) stop("data required to generate strata")
    strata.df <- readr::read_tsv(
      file = data,
      n_max = 1,
      na = "-",
      col_names = FALSE,
      col_types = readr::cols(.default = readr::col_character())) %>%
      tidyr::pivot_longer(
        data = .,
        cols = tidyselect::everything(),
        names_to = "DELETE",
        values_to = "INDIVIDUALS"
      ) %>%
      dplyr::mutate(INDIVIDUALS = clean_ind_names(INDIVIDUALS)) %>%
      dplyr::select(-DELETE) %>%
      dplyr::filter(!INDIVIDUALS %in% c("Catalog ID", "Cnt")) %>%
      dplyr::distinct(INDIVIDUALS) %>%
      dplyr::mutate(STRATA = rep("pop1", n()))
  } else {
    if (is.vector(strata)) {
      suppressMessages(
        strata.df <- readr::read_tsv(
          file = strata, col_names = TRUE,
          # col_types = col.types
          col_types = readr::cols(.default = readr::col_character())
        ))
    } else {
      strata.df <- strata
    }
  }

  colnames(strata.df) <- stringi::stri_replace_all_fixed(
    str = colnames(strata.df),
    pattern = "STRATA",
    replacement = "POP_ID",
    vectorize_all = FALSE
  )
  # Remove potential whitespace in pop_id
  strata.df$POP_ID <- clean_pop_names(strata.df$POP_ID)
  colnames.strata <- colnames(strata.df)

  # clean ids
  strata.df$INDIVIDUALS <- clean_ind_names(strata.df$INDIVIDUALS)

  # filtering the strata if blacklist id available
  if (!is.null(blacklist.id)) {
    strata.df <- dplyr::anti_join(x = strata.df, y = blacklist.id, by = "INDIVIDUALS")
  }

  strata.df <- dplyr::distinct(strata.df, POP_ID, INDIVIDUALS)
  return(strata.df)
}#End strata_haplo

#' @title Clean marker's name
#' @description Function to clean marker's name
#' of weird separators
#' that interfere with some packages
#' or codes. \code{/}, \code{:}, \code{-} and \code{.} are changed to an underscore
#' \code{_}.
#' @param x (character string) Markers character string.
#' @rdname clean_markers_names
#' @export
#' @keywords internal

clean_markers_names <- function(x) {
  x <- stringi::stri_replace_all_fixed(
    str = as.character(x),
    pattern = c("/", ":", "-", "."),
    replacement = "_",
    vectorize_all = FALSE)
}#End clean_markers_names

#' @title clean individual's name
#' @description function to clean individual's name
#' that interfere with some packages
#' or codes. \code{_} and \code{:} are changed to a dash \code{-}.
#' @param x (character string) Individuals character string.
#' @rdname clean_ind_names
#' @export
#' @keywords internal
clean_ind_names <- function(x) {
  x <- stringi::stri_replace_all_fixed(
    str = as.character(x),
    pattern = c("_", ":"),
    replacement = c("-", "-"),
    vectorize_all = FALSE)
}#End clean_ind_names

#' @title Clean population's name
#' @description Function to clean pop's name
#' that interfere with some packages
#' or codes. Space is changed to an underscore \code{_}.
#' @param x (character string) Population character string.
#' @rdname clean_pop_names
#' @export
#' @keywords internal
clean_pop_names <- function(x) {
  x <- stringi::stri_replace_all_fixed(
    str = as.character(x),
    pattern = " ",
    replacement = "_",
    vectorize_all = FALSE)
}#End clean_pop_names



#' @title loc_pop_stats
#' @rdname loc_pop_stats
#' @description locus stats per individual
#' @export
#' @keywords internal
loc_pop_stats <- function(x) {
  res <- dplyr::group_by(x, LOCUS, POP_ID) %>%
    dplyr::summarise(
      HOM = length(POLYMORPHISM[POLYMORPHISM == "hom"]),
      HET = length(POLYMORPHISM[POLYMORPHISM == "het"]),
      N_GENOT = HOM + HET,
      HOM_O = HOM / N_GENOT,
      HET_O = HET / N_GENOT
    ) %>%
    dplyr::mutate(
      NN = 2 * N_GENOT,
      N_INV = N_GENOT / (N_GENOT - 1)
    ) %>%
    dplyr::ungroup(.)
  return(res)
}#End individual_stats

#' @title separate_haplo_long
#' @rdname separate_haplo_long
#' @description separate haplotype and gather
#' @export
#' @keywords internal

separate_haplo_long <- function(x) {
  res <- dplyr::select(x, LOCUS, POP_ID, INDIVIDUALS, HAPLOTYPES)  %>%
    tidyr::separate(
      col = HAPLOTYPES, into = c("ALLELE1", "ALLELE2"),
      sep = "/", extra = "drop", remove = TRUE
    ) %>%
    dplyr::mutate(ALLELE2 = ifelse(is.na(ALLELE2), ALLELE1, ALLELE2)) %>%
    dplyr::select(-INDIVIDUALS) %>%
    tidyr::pivot_longer(
      data = .,
      cols = -c("LOCUS", "POP_ID"),
      names_to = "ALLELE_GROUP",
      values_to = "ALLELES"
    ) %>%
    dplyr::select(-ALLELE_GROUP)
  return(res)
}#End separate_haplo_long

#' @title separate_haplo
#' @rdname separate_haplo
#' @description separate haplotype. NO gather
#' @export
#' @keywords internal
separate_haplo <- function(x) {
  res <- dplyr::select(x, LOCUS, POP_ID, INDIVIDUALS, HAPLOTYPES)  %>%
    tidyr::separate(
      col = HAPLOTYPES, into = c("ALLELE1", "ALLELE2"),
      sep = "/", extra = "drop", remove = TRUE
    ) %>%
    dplyr::mutate(ALLELE2 = ifelse(is.na(ALLELE2), ALLELE1, ALLELE2))
  return(res)
}#End separate_haplo

# Pi function ----------------------------------------------------------------
#' @title pi
#' @rdname pi
#' @description Calculates pi
#' @export
#' @keywords internal
pi <- function(data, read.length) {
  # data <- dplyr::filter(data, LOCUS == "2")#test
  y <- dplyr::select(data, ALLELES) %>% purrr::flatten_chr(.)

  if (length(unique(y)) <= 1) {
    pi <- tibble::data_frame(PI = as.numeric(0))
  } else {

    #1 Get all pairwise comparison
    allele.pairwise <- utils::combn(unique(y), 2)

    #2 Calculate pairwise nucleotide mismatches
    pairwise.mismatches <- apply(allele.pairwise, 2, function(z) {
      stringdist::stringdist(a = z[1], b = z[2], method = "hamming")
    })

    #3 Calculate allele frequency
    allele.freq <- table(y) / length(y)

    #4 Calculate nucleotide diversity from pairwise mismatches and allele frequency
    pi <- apply(allele.pairwise, 2, function(y) allele.freq[y[1]] * allele.freq[y[2]])
    pi <- tibble::data_frame(PI = sum(pi * pairwise.mismatches) / read.length)
  }
  return(pi)
}#End pi


#' @title pi_pop
#' @rdname pi_pop
#' @description Calculates pi per pop
#' @export
#' @keywords internal
pi_pop <- function(data, read.length, parallel.core) {
  pop <- unique(data$POP_ID)
  message("    Pi calculations for pop: ", pop)
  # data <- df.split.pop[["DD"]]
  # data <- df.split.pop[["SKY"]]

  pi.pop <- split(x = data, f = data$LOCUS)
  pi.pop <- .stackr_parallel(
    X = pi.pop,
    FUN = pi,
    mc.cores = parallel.core,
    read.length = read.length
  ) %>%
    dplyr::bind_rows(.) %>%
    dplyr::summarise(PI_NEI = mean(PI, na.rm = TRUE)) %>%
    tibble::add_column(.data = ., POP_ID = pop, .before = "PI_NEI")

  return(pi.pop)
}#End pi_pop


# change POP_ID names

#' @name change_pop_names
#' @title Transform into a factor the POP_ID column, change names and reorder the levels
#' @description Transform into a factor the POP_ID column, change names and reorder the levels
#' @rdname change_pop_names
#' @export
#' @keywords internal

change_pop_names <- function(data, pop.levels = NULL, pop.labels = NULL) {

  # checks ---------------------------------------------------------------------
  if (missing(data)) stop("Input file missing")

  # POP_ID in gsi_sim does not like spaces, we need to remove space in everything touching POP_ID...

  # removing spaces in data$POP_ID, pop.levels and pop.labels
  if (!is.null(pop.levels) & is.null(pop.labels)) {
    pop.levels <- stringi::stri_replace_all_fixed(pop.levels, pattern = " ", replacement = "_", vectorize_all = FALSE)
    pop.labels <- pop.levels
  }

  if (!is.null(pop.labels) & is.null(pop.levels)) stop("pop.levels is required if you use pop.labels")

  if (!is.null(pop.labels)) {
    if (length(pop.labels) != length(pop.levels)) stop("pop.labels and pop.levels must have the same length (number of groups)")
    pop.labels <- stringi::stri_replace_all_fixed(pop.labels, pattern = " ", replacement = "_", vectorize_all = FALSE)
  }

  # in the data
  data$POP_ID <- stringi::stri_replace_all_fixed(data$POP_ID, pattern = " ", replacement = "_", vectorize_all = FALSE)


  # the number of pop in data == pop.levels
  if (!is.null(pop.levels)) {
    if (dplyr::n_distinct(data$POP_ID) != length(pop.levels)) {
      stop("The number of groups in your input file doesn't match the number of groups in pop.levels")
    }
  }
  # convert POP_ID to factor and change names-----------------------------------

  if (is.null(pop.levels)) { # no pop.levels
    data$POP_ID <- factor(data$POP_ID)
  } else {# with pop.levels
    data$POP_ID <- factor(x = data$POP_ID, levels = pop.levels, ordered = TRUE)
    levels(data$POP_ID) <- pop.labels
  }
  return(data)
}# end function change_pop_names
thierrygosselin/stackr documentation built on Nov. 11, 2020, 11 a.m.