## Summary and tables
#' @title Haplotypes file summary
#' @description STACKS batch_x.haplotypes.tsv file summary.
#' The output of the function is a summary table for populations with:
#' \enumerate{
#' \item assembly artifacts and/or sequencing errors: loci with > 2 alleles.
#' Genotypes with more than 2 alleles are erased before estimating the subsequent statistics.
#' \item consensus reads (reads with no variation/polymorphism throughout the dataset).
#' \item monomorphic loci (at the population and overall level)
#' \item polymorphic loci (at the population and overall level)
#' \item haplotypes statistics for the observed and expected homozygosity and
#' heterozygosity (HOM_O, HOM_E, HET_O, HET_E).
#' \item Gene Diversity within populations (Hs) and averaged over all populations.
#' Not to confuse with Expected Heterozygosity(HET_E).
#' Here, Hs statistic includes a correction for sampling bias stemming from
#' sampling a limited number of individuals per population (Nei, 1987).
#' \item Wright’s inbreeding coefficient (Fis)
#' \item Nei's inbreeding coefficient (Gis) that include a correction for sampling bias.
#' \item \strong{IBDG}: a proxy measure of the realized proportion of the genome
#' that is identical by descent
#' \item \strong{FH measure}: an individual-base estimate that is
#' based on the excess in the observed number of homozygous
#' genotypes within an individual relative to the mean number of homozygous
#' genotypes expected under random mating (Keller et al., 2011; Kardos et al., 2015).
#' \item \strong{Pi}: the nucleotide diversity (Nei & Li, 1979) measured here consider the
#' consensus reads in the catalog (no variation between population sequences).
#' The \code{read.length} argument is used directly in the calculations.
#' To be correctly estimated, the reads obviously need to be of identical size...
#' \item \strong{PRIVATE_HAPLOTYPES}: the number of private haplotypes.
#' }
#' @param data The 'batch_x.haplotypes.tsv' created by STACKS.
#' @param strata A tab delimited file with 2 columns with header:
#' \code{INDIVIDUALS} and \code{STRATA}.
#' The \code{STRATA} column can be any hierarchical grouping.
#' To create a strata file see \href{https://thierrygosselin.github.io/radiator/reference/individuals2strata.html}{individuals2strata}.
#' If you have already run
#' \href{http://catchenlab.life.illinois.edu/stacks/}{stacks} on your data,
#' the strata file is similar to a stacks `population map file`, make sure you
#' have the required column names (\code{INDIVIDUALS} and \code{STRATA}).
#' The strata also works as a whitelist of ids,
#' individuals not in the strata file will be removed from the dataset.
#' @param read.length (number) The length in nucleotide of your reads
#' (e.g. \code{read.length = 100}).
#' @param whitelist.markers (optional) A whitelist of loci with a column name
#' 'LOCUS'. The whitelist is located in the global environment or in the
#' directory (e.g. "whitelist.txt").
#' If a whitelist is used, the files written to the directory will have
#' '.filtered' in the filename.
#' Default: \code{whitelist.markers = NULL}.
#' @param blacklist.id (optional) A blacklist with individual ID and
#' a column header 'INDIVIDUALS'. The blacklist is in the global environment
#' or in the directory (with "blacklist.txt").
#' Default: \code{blacklist.id = NULL}.
#' @param keep.consensus (optional) Using whitelist of markers can automatically
#' remove consensus markers from the nucleotide diversity (Pi) calculations.
#' If you know what you are doing, play with this argument
#' to keep consensus markers.
#' Default: \code{keep.consensus = TRUE}.
#' @param pop.levels (optional, string) This refers to the levels in a factor. In this
#' case, the id of the pop.
#' Use this argument to have the pop ordered your way instead of the default
#' alphabetical or numerical order. e.g. \code{pop.levels = c("QUE", "ONT", "ALB")}
#' instead of the default \code{pop.levels = c("ALB", "ONT", "QUE")}.
#' White spaces in population names are replaced by underscore.
#' Default: \code{pop.levels = NULL}.
#' @param pop.labels (optional, string) Use this argument to rename/relabel
#' your pop or combine your pop. e.g. To combine \code{"QUE"} and \code{"ONT"}
#' into a new pop called \code{"NEW"}:
#' (1) First, define the levels for your pop with \code{pop.levels} argument:
#' \code{pop.levels = c("QUE", "ONT", "ALB")}.
#' (2) then, use \code{pop.labels} argument:
#' \code{pop.levels = c("NEW", "NEW", "ALB")}.
#' To rename \code{"QUE"} to \code{"TAS"}:
#' \code{pop.labels = c("TAS", "ONT", "ALB")}.
#' Default: \code{pop.labels = NULL}. If you find this too complicated,
#' there is also the \code{strata} argument that can do the same thing,
#' see below.
#' White spaces in population names are replaced by underscore.
#' @param artifact.only (optional, logical) With default \code{artifact.only = TRUE},
#' the function will stop after generating blacklists of artifacts and consensus loci
#' (no pi, no FH measures). This is faster and useful to run at the beginning of the pipeline.
#' Run again at the end, with blacklists and whitelists to get reliable estimates of
#' gene diverisity, IBDG and FH.
#' @param parallel.core (optional) The number of core for parallel
#' programming during Pi calculations.
#' Default: \code{parallel.core = parallel::detectCores() - 1}.
#' @return The function returns a list with:
#' \enumerate{
#' \item $consensus.pop # if consensus reads are found
#' \item $consensus.loci # if consensus reads are found
#' \item $artifacts.pop # if artifacts are found
#' \item $artifacts.loci # if artifacts are found
#' \item $individual.summary: the individual's info for FH and Pi
#' \item $summary: the summary statistics per populations and averaged over all pops.
#' \item $private.haplotypes: a tibble with LOCUS, POP_ID and private haplotypes
#' \item $private.haplotypes.summary: a summary of the number of private haplotypes per populations
#' }
#' Also in the list 3 plots (also written in the folder):
#' \enumerate{
#' \item $scatter.plot.Pi.Fh.ind: showing Pi and FH per individuals
#' \item $scatter.plot.Pi.Fh.pop: showing Pi and FH per pop
#' \item $boxplot.pi: showing the boxplot of Pi per pop
#' \item $boxplot.fh: showing the boxplot of FH per pop
#' }
#' use $ to access each #' objects in the list.
#' The function potentially write 4 files in the working directory:
#' blacklist of unique loci with assembly artifacts and/or sequencing errors
#' (per individuals and globally), a blacklist of unique consensus loci
#' and a summary of the haplotype file by population.
#' @examples
#' \dontrun{
#' # The simplest way to run the function:
#' sum <- summary_haplotypes(
#' data = "batch_1.haplotypes.tsv",
#' strata = "strata_brook_charr.tsv",
#' read.length = 90)
#' # if you want pi and fh calculated (ideally on filtered data):
#' sum <- summary_haplotypes(
#' data = "batch_1.haplotypes.tsv",
#' strata = "strata_brook_charr.tsv",
#' read.length = 90,
#' artifact.only = FALSE)
#' }
#' @rdname summary_haplotypes
#' @export
#' @references Keller MC, Visscher PM, Goddard ME (2011)
#' Quantification of inbreeding due to distant ancestors and its detection
#' using dense single nucleotide polymorphism data. Genetics, 189, 237–249.
#' @references Kardos M, Luikart G, Allendorf FW (2015)
#' Measuring individual inbreeding in the age of genomics: marker-based
#' measures are better than pedigrees. Heredity, 115, 63–72.
#' @references Nei M, Li WH (1979)
#' Mathematical model for studying genetic variation in terms of
#' restriction endonucleases.
#' Proceedings of the National Academy of Sciences of
#' the United States of America, 76, 5269–5273.
#' @author Thierry Gosselin \email{thierrygosselin@@icloud.com} and
#' Anne-Laure Ferchaud \email{annelaureferchaud@@gmail.com}
summary_haplotypes <- function(
data,
strata = NULL,
read.length,
whitelist.markers = NULL,
blacklist.id = NULL,
keep.consensus = TRUE,
pop.levels = NULL,
pop.labels = NULL,
artifact.only = TRUE,
parallel.core = parallel::detectCores() - 1
) {
cat("#######################################################################\n")
cat("#################### stackr::summary_haplotypes #######################\n")
cat("#######################################################################\n")
timing <- proc.time()
opt.change <- getOption("width")
options(width = 70)
res <- list() # to store results
if (artifact.only) {
message("artifact.only: TRUE
the function will only output artifacts and consensus blacklists.\n")
} else {
message("artifact.only: FALSE
the function will run and compute all summary statistics, time for coffee...\n")
}
if (missing(data)) stop("data argument is missing")
if (missing(read.length)) stop("read.length argument is required")
if (!is.null(pop.levels) & is.null(pop.labels)) {
pop.levels <- stringi::stri_replace_all_fixed(pop.levels, pattern = " ", replacement = "_", vectorize_all = FALSE)
pop.labels <- pop.levels
}
if (!is.null(pop.labels) & is.null(pop.levels)) stop("pop.levels is required if you use pop.labels")
if (!is.null(pop.labels)) {
if (length(pop.labels) != length(pop.levels)) stop("pop.levels and pop.labels with different length: check arguments")
pop.labels <- stringi::stri_replace_all_fixed(pop.labels, pattern = " ", replacement = "_", vectorize_all = FALSE)
}
# Date and time --------------------------------------------------------------
file.date <- format(Sys.time(), "%Y%m%d@%H%M")
folder.extension <- stringi::stri_join("summary_haplotypes_", file.date, sep = "")
path.folder <- file.path(getwd(), folder.extension)
dir.create(path.folder)
message("Folder created: \n", folder.extension)
file.date <- NULL #unused object
# Import and filter blacklist id ---------------------------------------------
if (!is.null(blacklist.id)) {
blacklist.id <- suppressMessages(readr::read_tsv(blacklist.id, col_names = TRUE))
message("Filtering with blacklist of individuals: ", nrow(blacklist.id), " individual(s) blacklisted")
blacklist.id$INDIVIDUALS <- clean_ind_names(blacklist.id$INDIVIDUALS)
}
# STRATA ---------------------------------------------------------------------
strata.df <- strata_haplo(strata = strata, data = data, blacklist.id = blacklist.id)
# Import haplotype file ------------------------------------------------------
message("Importing and tidying STACKS haplotype file: ", data)
# import header row
want <- tibble::data_frame(
INFO = "CATALOG",
COL_TYPE = "c") %>%
dplyr::bind_rows(
dplyr::select(strata.df, INFO = INDIVIDUALS) %>%
dplyr::mutate(
COL_TYPE = rep("c", n()),
INFO = clean_ind_names(INFO)
))
haplo.col.type <- readr::read_tsv(
file = data,
n_max = 1,
na = "-",
col_names = FALSE,
col_types = readr::cols(.default = readr::col_character())) %>%
tidyr::pivot_longer(
data = .,
cols = tidyselect::everything(),
names_to = "DELETE",
values_to = "INFO"
) %>%
dplyr::mutate(INFO = clean_ind_names(INFO)) %>%
dplyr::select(-DELETE) %>%
dplyr::mutate(INFO = clean_ind_names(INFO)) %>%
dplyr::left_join(want, by = "INFO") %>%
dplyr::mutate(COL_TYPE = stringi::stri_replace_na(str = COL_TYPE, replacement = "_")) %>%
dplyr::select(COL_TYPE)
haplo.col.type[1,1] <- "c"
haplo.col.type <- purrr::flatten_chr(haplo.col.type) %>% stringi::stri_join(collapse = "")
# readr now faster/easier than fread...
haplotype <- readr::read_tsv(
file = data, col_names = TRUE, na = "-",
col_types = haplo.col.type)
colnames(haplotype) <- stringi::stri_replace_all_fixed(
str = colnames(haplotype),
pattern = c("# Catalog ID", "Catalog ID", "# Catalog Locus ID"),
replacement = c("LOCUS", "LOCUS", "LOCUS"), vectorize_all = FALSE)
if (tibble::has_name(haplotype, "Seg Dist")) {
haplotype <- dplyr::select(.data = haplotype, -`Seg Dist`)
}
haplotype <- tidyr::pivot_longer(
data = haplotype,
cols = -LOCUS,
names_to = "INDIVIDUALS",
values_to = "HAPLOTYPES"
)
haplotype$INDIVIDUALS <- clean_ind_names(haplotype$INDIVIDUALS)
want <- haplo.col.type <- NULL
# Import whitelist of markers-------------------------------------------------
if (is.null(whitelist.markers)) {
haplotype <- dplyr::mutate(haplotype, WHITELIST = rep("whitelist", n()))
} else {
whitelist.markers <- suppressMessages(readr::read_tsv(whitelist.markers, col_names = TRUE))
# Check that whitelist has "LOCUS" column
if (!tibble::has_name(whitelist.markers, "LOCUS")) {
stop("Whitelist requires a LOCUS column, check argument documentation...")
}
whitelist.markers <- dplyr::select(.data = whitelist.markers, LOCUS) %>%
dplyr::mutate(LOCUS = as.character(LOCUS)) %>%
dplyr::arrange(as.integer(LOCUS))
haplotype <- haplotype %>%
dplyr::mutate(
WHITELIST = dplyr::if_else(LOCUS %in% whitelist.markers$LOCUS,
"whitelist", "blacklist")
)
whitelist.markers <- "whitelist"
}
# Population levels and strata -----------------------------------------------
message("Making the haplotype file population-wise...")
haplotype <- dplyr::left_join(x = haplotype, y = strata.df, by = "INDIVIDUALS")
strata.df <- NULL
# using pop.levels and pop.labels info if present
haplotype <- change_pop_names(data = haplotype, pop.levels = pop.levels, pop.labels = pop.labels)
pop <- unique(haplotype$POP_ID)
# Locus with consensus alleles -----------------------------------------------
message("Scanning for consensus markers...")
consensus.pop <- haplotype %>%
dplyr::filter(HAPLOTYPES == "consensus") %>%
dplyr::distinct(LOCUS, POP_ID) %>%
dplyr::mutate(CONSENSUS = rep("consensus", times = n())) %>%
dplyr::arrange(as.numeric(LOCUS), POP_ID)
if (length(consensus.pop$CONSENSUS) > 0) {
# Create a list of consensus loci
blacklist.loci.consensus <- consensus.pop %>%
dplyr::ungroup(.) %>%
dplyr::distinct(LOCUS) %>%
dplyr::arrange(as.numeric(LOCUS))
message(" Generated a file with ", nrow(blacklist.loci.consensus), " consensus loci")
message(" Details in: blacklist.loci.consensus.txt")
readr::write_tsv(
x = blacklist.loci.consensus,
path = stringi::stri_join(path.folder, "/blacklist.loci.consensus.txt"),
col_names = TRUE
)
haplotype <- dplyr::mutate(
haplotype,
CONSENSUS = dplyr::if_else(
LOCUS %in% blacklist.loci.consensus$LOCUS,
"consensus", "not_consensus"))
blacklist.loci.consensus.sum <- blacklist.loci.consensus %>%
dplyr::ungroup(.) %>%
dplyr::summarise(CONSENSUS = n()) %>%
dplyr::select(CONSENSUS)
consensus <- consensus.pop %>%
dplyr::group_by(POP_ID) %>%
dplyr::summarise(CONSENSUS = dplyr::n_distinct(LOCUS)) %>%
dplyr::ungroup(.)
res$consensus.pop <- consensus.pop
res$consensus.loci <- blacklist.loci.consensus
consensus.pop <- blacklist.loci.consensus <- NULL
} else {
res$consensus.pop <- NULL
res$consensus.loci <- NULL
blacklist.loci.consensus.sum <- tibble::data_frame(CONSENSUS = as.integer(0))
consensus <- tibble::data_frame(
POP_ID = pop,
CONSENSUS = as.integer(rep(0, length(pop)))
)
haplotype <- dplyr::mutate(haplotype, CONSENSUS = rep("not_consensus", n()))
}
# samples per pop
sample.number <- dplyr::distinct(haplotype, INDIVIDUALS, POP_ID) %>%
dplyr::group_by(POP_ID) %>%
dplyr::tally(.) %>%
dplyr::rename(NUM = n) %>%
dplyr::ungroup(.) %>%
tibble::add_row(.data = ., POP_ID = "OVERALL", NUM = sum(.$NUM))
# Sequencing errors and assembly artifact ------------------------------------
# Locus with > 2 alleles by individuals
message("Scanning for assembly artifacts...")
# haplotype <- haplotype %>%
# dplyr::mutate(
# ARTIFACTS = stringi::stri_count_fixed(HAPLOTYPES, "/"),
# ARTIFACTS = dplyr::if_else(ARTIFACTS > 1, "artifact", "not_artifact"))
haplotype <- haplotype %>%
dplyr::mutate(
POLYMORPHISM = stringi::stri_count_fixed(HAPLOTYPES, "/"),
POLYMORPHISM = dplyr::if_else(
POLYMORPHISM > 1, "artifact",
dplyr::if_else(
POLYMORPHISM == 1, "het", "hom", missing = "missing")),
POLYMORPHISM = dplyr::if_else(CONSENSUS == "consensus", "consensus", POLYMORPHISM),
POLYMORPHISM = dplyr::if_else(stringi::stri_detect_fixed(HAPLOTYPES, "N"),
"artifact", POLYMORPHISM)
) %>%
dplyr::select(POP_ID, INDIVIDUALS, LOCUS, HAPLOTYPES, WHITELIST, POLYMORPHISM)
# artifacts.ind <- dplyr::filter(
# haplotype,
# WHITELIST == "whitelist",
# CONSENSUS == "not_consensus")
artifacts.ind <- dplyr::filter(
haplotype,
WHITELIST == "whitelist",
POLYMORPHISM != "consensus")
total.genotype.number.haplo <- length(stats::na.omit(artifacts.ind$HAPLOTYPES))
artifacts.ind <- artifacts.ind %>%
dplyr::filter(POLYMORPHISM == "artifact") %>%
dplyr::arrange(LOCUS, POP_ID, INDIVIDUALS) %>%
dplyr::select(LOCUS, POP_ID, INDIVIDUALS, HAPLOTYPES, ARTIFACTS = POLYMORPHISM)
if (nrow(artifacts.ind) > 0) {
erased.genotype.number <- length(artifacts.ind$HAPLOTYPES)
percent.haplo <- paste(round(((erased.genotype.number/total.genotype.number.haplo)*100), 2), "%", sep = " ")
# Write the list of locus, individuals with artifacts
if (is.null(whitelist.markers)) {
filename.artifacts.ind <- "blacklist.loci.artifacts.ind.txt"
} else {
filename.artifacts.ind <- "blacklist.loci.artifacts.ind.filtered.txt"
}
readr::write_tsv(
x = artifacts.ind,
path = stringi::stri_join(path.folder, "/", filename.artifacts.ind),
col_names = TRUE)
message(" Generated a file with ",
nrow(artifacts.ind),
" artifact(s) genotype(s) by individuals")
message(" Details in: ", filename.artifacts.ind)
# res$artifacts.pop <- dplyr::ungroup(artifacts.ind) %>%
# dplyr::distinct(LOCUS, POP_ID, ARTIFACTS) %>%
# dplyr::arrange(as.numeric(LOCUS), POP_ID)
res$artifacts.pop <- dplyr::ungroup(artifacts.ind) %>%
dplyr::distinct(LOCUS, POP_ID, ARTIFACTS) %>%
dplyr::arrange(as.numeric(LOCUS), POP_ID)
artifacts <- res$artifacts.pop %>%
dplyr::group_by(POP_ID) %>%
dplyr::tally(.) %>%
dplyr::rename(ARTIFACTS = n)
res$artifacts.loci <- dplyr::ungroup(res$artifacts.pop) %>%
dplyr::distinct(LOCUS) %>%
dplyr::arrange(as.numeric(LOCUS))
blacklist.loci.artifacts.sum <- res$artifacts.loci %>%
dplyr::ungroup(.) %>%
dplyr::summarise(ARTIFACTS = n()) %>%
dplyr::select(ARTIFACTS)
# Write the unique list of paralogs blacklisted to a file
if (is.null(whitelist.markers)) {
filename.paralogs <- "blacklist.loci.artifacts.txt"
} else {
filename.paralogs <- "blacklist.loci.artifacts.filtered.txt"
}
readr::write_tsv(
x = res$artifacts.loci,
path = stringi::stri_join(path.folder, "/", filename.paralogs),
col_names = TRUE
)
message(" Generated a file with ", nrow(res$artifacts.loci),
" artifact loci")
message(" Details in: ", filename.paralogs)
message(" Out of a total of ", total.genotype.number.haplo, " genotypes")
message(" ", percent.haplo, " (", erased.genotype.number, ")"," outlier genotypes will be erased")
haplotype <- suppressWarnings(
haplotype %>%
dplyr::mutate(
HAPLOTYPES = dplyr::if_else(POLYMORPHISM == "artifact", NA_character_, HAPLOTYPES)
))
} else {
res$artifacts.pop <- res$artifacts.loci <- NULL
artifacts <- tibble::data_frame(POP_ID = pop, ARTIFACTS = as.integer(rep(0, length(pop))))
blacklist.loci.artifacts.sum <- tibble::data_frame(ARTIFACTS = as.integer(0))
# haplotype <- dplyr::mutate(haplotype, ARTIFACTS = rep("not_artifact", n()))
}
artifacts.ind <- NULL
# bind info
consensus.artifacts <- dplyr::full_join(artifacts, consensus, by = "POP_ID")
artifacts <- consensus <- NULL
haplotype.filtered <- haplotype %>%
dplyr::filter(WHITELIST == "whitelist", POLYMORPHISM != "not_consensus")
n.individuals <- dplyr::n_distinct(haplotype.filtered$INDIVIDUALS)
n.markers <- dplyr::n_distinct(haplotype.filtered$LOCUS)
if (!artifact.only) {
# FH ------------------------------------------------------------------------
message("Genome-Wide Identity-By-Descent calculations (FH)...")
# summary.ind <- dplyr::mutate(
# .data = haplotype.filtered,
# IND_LEVEL_POLYMORPHISM = dplyr::if_else(
# stringi::stri_count_fixed(HAPLOTYPES, "/") == 1,
# "het", "hom", missing = "missing"))
# summary <- dplyr::mutate(
# .data = haplotype.filtered,
# IND_LEVEL_POLYMORPHISM = dplyr::if_else(
# stringi::stri_count_fixed(HAPLOTYPES, "/") == 1,
# "het", "hom", missing = "missing"))
# Some could start averaging by individuals accross markers and then by populations...
# GenoDive, Hierfstat and Nei computes by locus...
if (n.markers < 5000) {
summary.loc.pop <- haplotype.filtered %>%
dplyr::group_by(LOCUS, POP_ID) %>%
dplyr::summarise(
HOM = length(POLYMORPHISM[POLYMORPHISM == "hom"]),
HET = length(POLYMORPHISM[POLYMORPHISM == "het"]),
N_GENOT = HOM + HET,
HOM_O = HOM / N_GENOT,
HET_O = HET / N_GENOT
) %>%
dplyr::mutate(
NN = 2 * N_GENOT,
N_INV = N_GENOT / (N_GENOT - 1)
) %>%
dplyr::ungroup(.)
} else {
split.vec <- dplyr::distinct(haplotype.filtered, LOCUS) %>%
dplyr::mutate(SPLIT_VEC = as.integer(floor((parallel.core * 3 * (1:n.markers - 1) / n.markers) + 1)))
# split_vec_row
## Need to upgrade to split_tibble_rows
# summary.loc.pop <- split_tibble_rows()
summary.loc.pop <- haplotype.filtered %>%
dplyr::left_join(split.vec, by = "LOCUS") %>%
split(x = ., f = .$SPLIT_VEC) %>%
.stackr_parallel(
X = .,
FUN = loc_pop_stats,
mc.cores = parallel.core
) %>%
dplyr::bind_rows(.)
split.vec <- NULL
}
# HS Gene Diversity within populations ---------------------------------------
# not the same as Expected Heterozygosity
# haplotype.filtered.genotyped <- haplotype.filtered %>% dplyr::filter(!is.na(HAPLOTYPES))
n.row <- nrow(haplotype.filtered)
split.vec <- as.integer(floor((parallel.core * 20 * (1:n.row - 1) / n.row) + 1))
n.row <- NULL
# split.vec.locus <- dplyr::distinct(haplotype.filtered, LOCUS) %>%
# dplyr::mutate(
# SPLIT_VEC = factor(as.integer(floor((parallel.core * 20 * (1:n() - 1) / n()) + 1)))
# )
haplotype.filtered.split <- split(x = haplotype.filtered, f = split.vec) %>%
.stackr_parallel_mc(
X = ., FUN = separate_haplo_long, mc.cores = parallel.core) %>%
dplyr::bind_rows(.)
hs <- haplotype.filtered.split %>%
dplyr::group_by(LOCUS, POP_ID) %>%
dplyr::count(ALLELES) %>%
dplyr::summarise(HOM_E = sum((n / sum(n))^2)) %>%
dplyr::ungroup(.) %>%
dplyr::full_join(dplyr::select(summary.loc.pop, LOCUS, POP_ID, HET_O, NN, N_INV), by = c("LOCUS", "POP_ID")) %>%
dplyr::group_by(LOCUS, POP_ID) %>%
dplyr::mutate(HS = N_INV * (1 - HOM_E - HET_O / NN)) %>%
dplyr::ungroup(.)
summary.pop <- summary.loc.pop %>%
dplyr::group_by(POP_ID) %>%
dplyr::summarise(
HOM_O = mean(HOM_O, na.rm = TRUE),
HET_O = mean(HET_O, na.rm = TRUE)
) %>%
dplyr::ungroup(.) %>%
dplyr::full_join(
dplyr::group_by(.data = hs, POP_ID) %>%
dplyr::summarise(
HOM_E = mean(HOM_E, na.rm = TRUE),
HET_E = (1 - HOM_E),
HS = mean(HS, na.rm = TRUE)
) %>%
dplyr::ungroup(.)
, by = "POP_ID")
summary.locus <- summary.loc.pop %>%
dplyr::group_by(LOCUS) %>%
dplyr::summarise(
HOM_O = mean(HOM_O, na.rm = TRUE),
HET_O = mean(HET_O, na.rm = TRUE)
) %>%
dplyr::ungroup(.) %>%
dplyr::full_join(
dplyr::group_by(.data = hs, LOCUS) %>%
dplyr::summarise(
HOM_E = mean(HOM_E, na.rm = TRUE),
HET_E = (1 - HOM_E),
HS = mean(HS, na.rm = TRUE)
) %>%
dplyr::ungroup(.)
, by = "LOCUS")
hs <- summary.loc.pop <- NULL
summary.ind <- haplotype.filtered %>%
dplyr::group_by(INDIVIDUALS, POP_ID) %>%
dplyr::summarise(
HOM = length(POLYMORPHISM[POLYMORPHISM == "hom"]),
HET = length(POLYMORPHISM[POLYMORPHISM == "het"]),
N_GENOT = HOM + HET,
HOM_O = HOM / N_GENOT,
HET_O = HET / N_GENOT
) %>%
dplyr::ungroup(.)
haplotype.filtered <- NULL
# required functions
# freq_hom <- function(x) {
# res <- dplyr::select(x, -SPLIT_VEC) %>%
# dplyr::group_by(LOCUS, POP_ID, ALLELES) %>%
# dplyr::summarise(
# FREQ_ALLELES = length(ALLELES)/mean(DIPLO),
# HOM_E = FREQ_ALLELES * FREQ_ALLELES
# ) %>%
# dplyr::select(-FREQ_ALLELES) %>%
# dplyr::ungroup(.)
# return(res)
# }#End freq_hom
# freq.hom.pop <- haplotype.filtered %>%
# dplyr::filter(!is.na(HAPLOTYPES))
# diplo <- freq.hom.pop %>%
# dplyr::distinct(LOCUS, POP_ID, INDIVIDUALS) %>%
# dplyr::group_by(LOCUS, POP_ID) %>%
# dplyr::tally(.) %>%
# dplyr::ungroup(.) %>%
# dplyr::mutate(n = 2 * n) %>%
# dplyr::rename(DIPLO = n)
# n.row <- nrow(haplotype.filtered.split)
# split.vec <- as.integer(floor((parallel.core * 20 * (1:n.row - 1) / n.row) + 1))
# n.row <- NULL
# split.vec.locus <- dplyr::distinct(haplotype.filtered.split, LOCUS) %>%
# dplyr::mutate(
# SPLIT_VEC = factor(as.integer(floor((parallel.core * 20 * (1:n() - 1) / n()) + 1)))
# )
# # freq.hom.pop <- split(x = freq.hom.pop, f = split.vec) %>%
# # .stackr_parallel_mc(
# # X = ., FUN = separate_haplo, mc.cores = parallel.core) %>%
# # dplyr::bind_rows(.) %>%
# # dplyr::left_join(diplo, by = c("LOCUS", "POP_ID")) %>%
# # dplyr::left_join(split.vec.locus, by = "LOCUS") %>%
# # split(x = ., f = .$SPLIT_VEC) %>%
# # .stackr_parallel_mc(
# # X = ., FUN = freq_hom, mc.cores = parallel.core) %>%
# # dplyr::bind_rows(.) %>%
# # dplyr::group_by(LOCUS, POP_ID) %>%
# # dplyr::summarise(HOM_E = sum(HOM_E)) %>%
# # dplyr::group_by(POP_ID) %>%
# # dplyr::summarise(HOM_E = mean(HOM_E))
#
#
# freq.hom.pop <- hs %>%
# dplyr::group_by(POP_ID) %>%
# dplyr::summarise(HOM_E = mean(HOM_E))
split.vec <- NULL
# diplo <- split.vec.locus <- split.vec <- NULL
# IBDg with FH ---------------------------------------------------------------
# ok so here we need the info by id
# fh.i <- suppressWarnings(
# summary.ind %>%
# dplyr::full_join(freq.hom.pop, by = "POP_ID") %>%
# dplyr::mutate(FH = ((HOM_O - HOM_E)/(N_GENOT - HOM_E))) %>%
# dplyr::select(INDIVIDUALS, POP_ID, FH)
# )
res$individual.summary <- suppressWarnings(
summary.ind %>%
dplyr::full_join(dplyr::select(summary.pop, -HOM_O, -HET_O), by = "POP_ID") %>%
dplyr::mutate(FH = ((HOM_O - HOM_E)/(N_GENOT - HOM_E))) %>%
dplyr::select(INDIVIDUALS, POP_ID, FH)
)
summary.ind <- NULL
summary.pop <- summary.pop %>%
dplyr::full_join(
dplyr::group_by(.data = res$individual.summary, POP_ID) %>% dplyr::summarise(FH = mean(FH, na.rm = TRUE))
, by = "POP_ID") %>%
dplyr::mutate(
FIS = dplyr::if_else(
HET_O == 0, 0, round(((HET_E - HET_O) / HET_E), 6)),
GIS = dplyr::if_else(
HET_O == 0, 0, round(1 - HET_O / HS, 6))
) %>%
dplyr::select(POP_ID, HOM_O, HOM_E, HET_O, HET_E, HS, FIS, GIS, FH)
summary.locus <- summary.locus %>%
dplyr::mutate(
FIS = dplyr::if_else(
HET_O == 0, 0, round(((HET_E - HET_O) / HET_E), 6)),
GIS = dplyr::if_else(
HET_O == 0, 0, round(1 - HET_O / HS, 6))
)
summary.locus <- NULL
summary.overall <- suppressWarnings(dplyr::full_join(
sample.number,
dplyr::bind_rows(
summary.pop,
dplyr::summarise_if(
.tbl = summary.pop, .predicate = is.numeric, .funs = mean, na.rm = TRUE) %>%
tibble::add_column(.data = ., POP_ID = "OVERALL", .before = 1)
), by = "POP_ID"))
sample.number <- summary.pop <- NULL
# fh.tot <- fh.i %>%
# dplyr::summarise(
# HOM_O = round(mean(HOM_O), 6),
# HOM_E = round(mean(HOM_E), 6),
# HET_O = round(mean(1 - HOM_O), 6),
# HET_E = round(mean(1 - HOM_E), 6),
# FIS = ifelse(HET_O == 0, 0, round(((HET_E - HET_O) / HET_E), 6)),
# FH = mean(FH)
# )
# fh.tot <- tibble::data_frame(POP_ID = "OVERALL") %>%
# dplyr::bind_cols(fh.tot)
# fh.res <- suppressWarnings(dplyr::bind_rows(fh.pop, fh.tot) %>% dplyr::select(-POP_ID))
res$summary <- suppressWarnings(
dplyr::bind_rows(
haplotype.filtered.split %>%
dplyr::distinct(LOCUS, POP_ID, ALLELES) %>%
dplyr::group_by(LOCUS, POP_ID) %>%
dplyr::summarise(ALLELES_COUNT = length(ALLELES)) %>%
dplyr::group_by(POP_ID) %>%
dplyr::summarise(
MONOMORPHIC = length(ALLELES_COUNT[ALLELES_COUNT == 1]),
POLYMORPHIC = length(ALLELES_COUNT[ALLELES_COUNT >= 2])
) %>%
dplyr::full_join(consensus.artifacts, by = "POP_ID"),
haplotype.filtered.split %>%
dplyr::distinct(LOCUS, ALLELES) %>%
dplyr::group_by(LOCUS) %>%
dplyr::summarise(ALLELES_COUNT_OVERALL = length(ALLELES)) %>%
dplyr::ungroup(.) %>%
dplyr::summarise(
MONOMORPHIC = length(ALLELES_COUNT_OVERALL[ALLELES_COUNT_OVERALL == 1]),
POLYMORPHIC = length(ALLELES_COUNT_OVERALL[ALLELES_COUNT_OVERALL >= 2])
) %>%
dplyr::bind_cols(blacklist.loci.consensus.sum) %>%
dplyr::bind_cols(blacklist.loci.artifacts.sum) %>%
tibble::add_column(.data = ., POP_ID = "OVERALL", .before = 1)))
if (keep.consensus) {
res$summary <- res$summary %>%
dplyr::group_by(POP_ID) %>%
dplyr::mutate(
TOTAL = MONOMORPHIC + POLYMORPHIC + CONSENSUS + ARTIFACTS,
MONOMORPHIC_PROP = round(MONOMORPHIC / TOTAL, 4),
POLYMORPHIC_PROP = round(POLYMORPHIC / TOTAL, 4),
CONSENSUS_PROP = round(CONSENSUS / TOTAL, 4),
ARTIFACTS_PROP = round(ARTIFACTS / TOTAL, 4)
)
} else {
res$summary <- res$summary %>%
dplyr::select(-CONSENSUS) %>%
dplyr::group_by(POP_ID) %>%
dplyr::mutate(
TOTAL = MONOMORPHIC + POLYMORPHIC + ARTIFACTS,
MONOMORPHIC_PROP = round(MONOMORPHIC / TOTAL, 4),
POLYMORPHIC_PROP = round(POLYMORPHIC / TOTAL, 4),
ARTIFACTS_PROP = round(ARTIFACTS / TOTAL, 4)
)
}
res$summary <- suppressWarnings(
dplyr::full_join(res$summary, summary.overall, by = "POP_ID"))
consensus.artifacts <- blacklist.loci.consensus.sum <- blacklist.loci.artifacts.sum <- summary.overall <- NULL
# Private haplotypes -------------------------------------------------------
haplotype.filtered.split <- haplotype.filtered.split %>%
dplyr::select(LOCUS, POP_ID, HAPLOTYPES = ALLELES) %>%
dplyr::filter(!is.na(HAPLOTYPES)) %>%
dplyr::distinct(LOCUS, POP_ID, HAPLOTYPES)
res$private.haplotypes <- haplotype.filtered.split %>%
dplyr::group_by(LOCUS, HAPLOTYPES) %>%
dplyr::tally(.) %>%
dplyr::filter(n == 1) %>%
dplyr::select(-n) %>%
dplyr::ungroup(.) %>%
dplyr::left_join(haplotype.filtered.split, by = c("LOCUS", "HAPLOTYPES")) %>%
dplyr::select(LOCUS, POP_ID, HAPLOTYPES) %>%
readr::write_tsv(x = ., path = file.path(path.folder, "private.haplotypes.tsv"))
res$private.haplotypes.summary <- res$private.haplotypes %>%
dplyr::group_by(POP_ID) %>%
dplyr::tally(.) %>%
dplyr::ungroup(.) %>%
tibble::add_row(.data = ., POP_ID = "OVERALL", n = sum(.$n)) %>%
dplyr::mutate(PRIVATE_HAPLOTYPES = stringi::stri_join(POP_ID, " = ", n)) %>%
readr::write_tsv(x = ., path = file.path(path.folder, "private.haplotypes.summary.tsv"))
message("Number of private haplotypes = ", nrow(res$private.haplotypes))
message("Strata with the highest number of private haplotypes = ", res$private.haplotypes.summary$POP_ID[res$private.haplotypes.summary$n == max(res$private.haplotypes.summary$n)])
message("Number of private haplotype(s) per strata:\n", stringi::stri_join(res$private.haplotypes.summary$PRIVATE_HAPLOTYPES, collapse = "\n"))
haplotype.filtered.split <- NULL
# Nei & Li 1979 Nucleotide Diversity ---------------------------------------
message("Nucleotide diversity (Pi):")
message(" Read length used: ", read.length)
# Pi: by individuals--------------------------------------------------------
message(" Pi calculations: individuals...")
if (keep.consensus) {
pi.data <- dplyr::filter(
haplotype,
!is.na(HAPLOTYPES),
WHITELIST == "whitelist" | !POLYMORPHISM %in% "artifact")
} else {
pi.data <- dplyr::filter(
haplotype,
!is.na(HAPLOTYPES),
WHITELIST == "whitelist",
!POLYMORPHISM %in% c("artifact", "consensus"))
}
n.row <- nrow(pi.data)
split.vec <- as.integer(floor((parallel.core * 20 * (1:n.row - 1) / n.row) + 1))
n.row <- NULL
pi.data <- split(x = pi.data, f = split.vec) %>%
.stackr_parallel_mc(
X = ., FUN = separate_haplo, mc.cores = parallel.core) %>%
dplyr::bind_rows(.)
split.vec <- NULL
res$individual.summary <- dplyr::full_join(
res$individual.summary,
pi.data %>%
dplyr::mutate(
PI = (stringdist::stringdist(a = ALLELE1, b = ALLELE2, method = "hamming")) / read.length
) %>%
dplyr::group_by(INDIVIDUALS) %>%
dplyr::summarise(PI = mean(PI, na.rm = TRUE))
, by = "INDIVIDUALS")
# Pi: by pop------------------------------------------------------------------
message(" Pi calculations: populations...")
pi.data <- tidyr::pivot_longer(
data = pi.data,
cols = -c("LOCUS", "INDIVIDUALS", "POP_ID"),
names_to = "ALLELE_GROUP",
values_to = "ALLELES"
)
pi.pop <- pi.data %>%
split(x = ., f = .$POP_ID) %>%
purrr::map_df(
.x = ., .f = pi_pop,
read.length = read.length, parallel.core = parallel.core
)
# Pi: overall ---------------------------------------------------------------
message(" Pi calculations: overall")
pi.overall <- split(x = pi.data, f = pi.data$LOCUS)
pi.data <- NULL
pi.overall <- .stackr_parallel(
X = pi.overall,
FUN = pi,
mc.cores = parallel.core,
read.length = read.length
) %>%
dplyr::bind_rows(.) %>%
dplyr::summarise(PI_NEI = mean(PI, na.rm = TRUE)) %>%
tibble::add_column(.data = ., POP_ID = "OVERALL", .before = "PI_NEI")
# Combine the pop and overall data -------------------------------------------
pi.overall <- suppressWarnings(
dplyr::bind_rows(pi.pop, pi.overall))
pi.pop <- NULL
res$summary <- suppressWarnings(dplyr::full_join(res$summary, pi.overall, by = "POP_ID")) %>%
dplyr::ungroup(.)
pi.overall <- NULL
res$summary <- tibble::add_column(
.data = res$summary,
PRIVATE_HAPLOTYPES = res$private.haplotypes.summary$n)
if (is.null(whitelist.markers)) {
filename.sum <- "haplotype.catalog.loci.summary.pop.tsv"
} else {
filename.sum <- "haplotype.catalog.loci.summary.pop.filtered.tsv"
}
readr::write_tsv(x = res$summary, path = stringi::stri_join(path.folder, "/",filename.sum))
# Figures --------------------------------------------------------------------
res$scatter.plot.Pi.Fh.ind <- dplyr::filter(res$individual.summary, POP_ID != "OVERALL") %>%
ggplot2::ggplot(data = ., ggplot2::aes(x = FH, y = PI)) +
ggplot2::geom_point(ggplot2::aes(colour = POP_ID)) +
ggplot2::stat_smooth(method = stats::lm, level = 0.95, fullrange = FALSE, na.rm = TRUE) +
ggplot2::labs(x = "Individual IBDg (FH)") +
ggplot2::labs(y = "Individual nucleotide diversity (Pi)") +
ggplot2::theme(
axis.title.x = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
axis.title.y = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
legend.title = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
legend.text = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
strip.text.y = ggplot2::element_text(angle = 0, size = 10, family = "Helvetica", face = "bold"),
strip.text.x = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold")
)
ggplot2::ggsave(
filename = stringi::stri_join(path.folder, "/scatter.plot.Pi.Fh.ind.pdf"),
plot = res$scatter.plot.Pi.Fh.ind,
width = 20, height = 15,
dpi = 600, units = "cm", useDingbats = FALSE)
res$scatter.plot.Pi.Fh.pop <- dplyr::filter(res$summary, POP_ID != "OVERALL") %>%
ggplot2::ggplot(data = ., ggplot2::aes(x = FH, y = PI_NEI)) +
ggplot2::geom_point(ggplot2::aes(colour = POP_ID)) +
ggplot2::stat_smooth(method = stats::lm, level = 0.95, fullrange = FALSE, na.rm = TRUE) +
ggplot2::labs(x = "Individual IBDg (FH)") +
ggplot2::labs(y = "Individual nucleotide diversity (Pi)") +
ggplot2::theme(
axis.title.x = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
axis.title.y = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
legend.title = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
legend.text = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
strip.text.y = ggplot2::element_text(angle = 0, size = 10, family = "Helvetica", face = "bold"),
strip.text.x = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold")
)
ggplot2::ggsave(
filename = stringi::stri_join(path.folder, "/scatter.plot.Pi.Fh.pop.pdf"),
plot = res$scatter.plot.Pi.Fh.pop,
width = 20, height = 15,
dpi = 600, units = "cm", useDingbats = FALSE)
res$boxplot.pi <- dplyr::filter(res$individual.summary, POP_ID != "OVERALL") %>%
ggplot2::ggplot(data = ., ggplot2::aes(x = factor(POP_ID), y = PI, na.rm = TRUE)) +
ggplot2::geom_violin(trim = F) +
ggplot2::geom_boxplot(width = 0.1, fill = "black", outlier.colour = NA) +
ggplot2::stat_summary(fun = "mean", geom = "point", shape = 21, size = 2.5, fill = "white") +
ggplot2::labs(x = "Sampling sites") +
ggplot2::labs(y = "Individual nucleotide diversity (Pi)") +
ggplot2::theme(
legend.position = "none",
axis.title.x = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
axis.title.y = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
legend.title = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
legend.text = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
strip.text.x = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold")
)
ggplot2::ggsave(
filename = stringi::stri_join(path.folder, "/boxplot.pi.pdf"),
plot = res$boxplot.pi,
width = 20, height = 15,
dpi = 600, units = "cm", useDingbats = FALSE)
res$boxplot.fh <- dplyr::filter(res$individual.summary, POP_ID != "OVERALL") %>%
ggplot2::ggplot(data = ., ggplot2::aes(x = factor(POP_ID), y = FH, na.rm = T)) +
ggplot2::geom_violin(trim = F) +
ggplot2::geom_boxplot(width = 0.1, fill = "black", outlier.colour = NA) +
ggplot2::stat_summary(fun = "mean", geom = "point", shape = 21, size = 2.5, fill = "white") +
ggplot2::labs(x = "Sampling sites") +
ggplot2::labs(y = "Individual IBDg (FH)") +
ggplot2::theme(
legend.position = "none",
axis.title.x = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
axis.title.y = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
legend.title = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
legend.text = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
strip.text.x = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold")
)
ggplot2::ggsave(
filename = stringi::stri_join(path.folder, "/boxplot.fh.pdf"),
plot = res$boxplot.fh,
width = 20, height = 15,
dpi = 600, units = "cm", useDingbats = FALSE)
}
# results --------------------------------------------------------------------
cat("############################### RESULTS ###############################\n")
message("Number of populations: ", length(pop))
message("Number of individuals: ", n.individuals)
message("Number of loci in the catalog: ", n.markers)
if (!is.null(res$artifacts.pop)) {
message(" impacted by assembly artifacts and/or sequencing errors (loci > 2 alleles) = ", dplyr::n_distinct(res$artifacts.pop$LOCUS))
}
message(" with consensus alleles = ", dplyr::n_distinct(res$consensus.pop$LOCUS))
if (!is.null(res$artifacts.pop)) {
message("Number of artefactual genotypes/total genotypes (percent): ", erased.genotype.number, "/", total.genotype.number.haplo, " (", percent.haplo, ")")
}
message("\nFiles written: ")
if (!is.null(res$artifacts.pop)) message(filename.artifacts.ind)
if (!is.null(res$artifacts.pop)) message(filename.paralogs)
if (!artifact.only) message(filename.sum)
if (!is.null(res$consensus.loci)) message("blacklist.loci.consensus.txt")
if (!artifact.only) message("private.haplotypes.tsv")
if (!artifact.only) message("private.haplotypes.summary.tsv")
timing <- proc.time() - timing
message("\nComputation time: ", round(timing[[3]]), " sec")
cat("############################## completed ##############################\n")
options(width = opt.change)
return(res)
}
# Internal nested Function -----------------------------------------------------
#' @title strata_haplo
#' @description Manage strata
#' @rdname strata_haplo
#' @keywords internal
#' @export
strata_haplo <- function(strata = NULL, data = NULL, blacklist.id = NULL) {
if (is.null(strata)) {
message("No strata file provided")
message(" generating a strata with 1 grouping")
if (is.null(data)) stop("data required to generate strata")
strata.df <- readr::read_tsv(
file = data,
n_max = 1,
na = "-",
col_names = FALSE,
col_types = readr::cols(.default = readr::col_character())) %>%
tidyr::pivot_longer(
data = .,
cols = tidyselect::everything(),
names_to = "DELETE",
values_to = "INDIVIDUALS"
) %>%
dplyr::mutate(INDIVIDUALS = clean_ind_names(INDIVIDUALS)) %>%
dplyr::select(-DELETE) %>%
dplyr::filter(!INDIVIDUALS %in% c("Catalog ID", "Cnt")) %>%
dplyr::distinct(INDIVIDUALS) %>%
dplyr::mutate(STRATA = rep("pop1", n()))
} else {
if (is.vector(strata)) {
suppressMessages(
strata.df <- readr::read_tsv(
file = strata, col_names = TRUE,
# col_types = col.types
col_types = readr::cols(.default = readr::col_character())
))
} else {
strata.df <- strata
}
}
colnames(strata.df) <- stringi::stri_replace_all_fixed(
str = colnames(strata.df),
pattern = "STRATA",
replacement = "POP_ID",
vectorize_all = FALSE
)
# Remove potential whitespace in pop_id
strata.df$POP_ID <- clean_pop_names(strata.df$POP_ID)
colnames.strata <- colnames(strata.df)
# clean ids
strata.df$INDIVIDUALS <- clean_ind_names(strata.df$INDIVIDUALS)
# filtering the strata if blacklist id available
if (!is.null(blacklist.id)) {
strata.df <- dplyr::anti_join(x = strata.df, y = blacklist.id, by = "INDIVIDUALS")
}
strata.df <- dplyr::distinct(strata.df, POP_ID, INDIVIDUALS)
return(strata.df)
}#End strata_haplo
#' @title Clean marker's name
#' @description Function to clean marker's name
#' of weird separators
#' that interfere with some packages
#' or codes. \code{/}, \code{:}, \code{-} and \code{.} are changed to an underscore
#' \code{_}.
#' @param x (character string) Markers character string.
#' @rdname clean_markers_names
#' @export
#' @keywords internal
clean_markers_names <- function(x) {
x <- stringi::stri_replace_all_fixed(
str = as.character(x),
pattern = c("/", ":", "-", "."),
replacement = "_",
vectorize_all = FALSE)
}#End clean_markers_names
#' @title clean individual's name
#' @description function to clean individual's name
#' that interfere with some packages
#' or codes. \code{_} and \code{:} are changed to a dash \code{-}.
#' @param x (character string) Individuals character string.
#' @rdname clean_ind_names
#' @export
#' @keywords internal
clean_ind_names <- function(x) {
x <- stringi::stri_replace_all_fixed(
str = as.character(x),
pattern = c("_", ":"),
replacement = c("-", "-"),
vectorize_all = FALSE)
}#End clean_ind_names
#' @title Clean population's name
#' @description Function to clean pop's name
#' that interfere with some packages
#' or codes. Space is changed to an underscore \code{_}.
#' @param x (character string) Population character string.
#' @rdname clean_pop_names
#' @export
#' @keywords internal
clean_pop_names <- function(x) {
x <- stringi::stri_replace_all_fixed(
str = as.character(x),
pattern = " ",
replacement = "_",
vectorize_all = FALSE)
}#End clean_pop_names
#' @title loc_pop_stats
#' @rdname loc_pop_stats
#' @description locus stats per individual
#' @export
#' @keywords internal
loc_pop_stats <- function(x) {
res <- dplyr::group_by(x, LOCUS, POP_ID) %>%
dplyr::summarise(
HOM = length(POLYMORPHISM[POLYMORPHISM == "hom"]),
HET = length(POLYMORPHISM[POLYMORPHISM == "het"]),
N_GENOT = HOM + HET,
HOM_O = HOM / N_GENOT,
HET_O = HET / N_GENOT
) %>%
dplyr::mutate(
NN = 2 * N_GENOT,
N_INV = N_GENOT / (N_GENOT - 1)
) %>%
dplyr::ungroup(.)
return(res)
}#End individual_stats
#' @title separate_haplo_long
#' @rdname separate_haplo_long
#' @description separate haplotype and gather
#' @export
#' @keywords internal
separate_haplo_long <- function(x) {
res <- dplyr::select(x, LOCUS, POP_ID, INDIVIDUALS, HAPLOTYPES) %>%
tidyr::separate(
col = HAPLOTYPES, into = c("ALLELE1", "ALLELE2"),
sep = "/", extra = "drop", remove = TRUE
) %>%
dplyr::mutate(ALLELE2 = ifelse(is.na(ALLELE2), ALLELE1, ALLELE2)) %>%
dplyr::select(-INDIVIDUALS) %>%
tidyr::pivot_longer(
data = .,
cols = -c("LOCUS", "POP_ID"),
names_to = "ALLELE_GROUP",
values_to = "ALLELES"
) %>%
dplyr::select(-ALLELE_GROUP)
return(res)
}#End separate_haplo_long
#' @title separate_haplo
#' @rdname separate_haplo
#' @description separate haplotype. NO gather
#' @export
#' @keywords internal
separate_haplo <- function(x) {
res <- dplyr::select(x, LOCUS, POP_ID, INDIVIDUALS, HAPLOTYPES) %>%
tidyr::separate(
col = HAPLOTYPES, into = c("ALLELE1", "ALLELE2"),
sep = "/", extra = "drop", remove = TRUE
) %>%
dplyr::mutate(ALLELE2 = ifelse(is.na(ALLELE2), ALLELE1, ALLELE2))
return(res)
}#End separate_haplo
# Pi function ----------------------------------------------------------------
#' @title pi
#' @rdname pi
#' @description Calculates pi
#' @export
#' @keywords internal
pi <- function(data, read.length) {
# data <- dplyr::filter(data, LOCUS == "2")#test
y <- dplyr::select(data, ALLELES) %>% purrr::flatten_chr(.)
if (length(unique(y)) <= 1) {
pi <- tibble::data_frame(PI = as.numeric(0))
} else {
#1 Get all pairwise comparison
allele.pairwise <- utils::combn(unique(y), 2)
#2 Calculate pairwise nucleotide mismatches
pairwise.mismatches <- apply(allele.pairwise, 2, function(z) {
stringdist::stringdist(a = z[1], b = z[2], method = "hamming")
})
#3 Calculate allele frequency
allele.freq <- table(y) / length(y)
#4 Calculate nucleotide diversity from pairwise mismatches and allele frequency
pi <- apply(allele.pairwise, 2, function(y) allele.freq[y[1]] * allele.freq[y[2]])
pi <- tibble::data_frame(PI = sum(pi * pairwise.mismatches) / read.length)
}
return(pi)
}#End pi
#' @title pi_pop
#' @rdname pi_pop
#' @description Calculates pi per pop
#' @export
#' @keywords internal
pi_pop <- function(data, read.length, parallel.core) {
pop <- unique(data$POP_ID)
message(" Pi calculations for pop: ", pop)
# data <- df.split.pop[["DD"]]
# data <- df.split.pop[["SKY"]]
pi.pop <- split(x = data, f = data$LOCUS)
pi.pop <- .stackr_parallel(
X = pi.pop,
FUN = pi,
mc.cores = parallel.core,
read.length = read.length
) %>%
dplyr::bind_rows(.) %>%
dplyr::summarise(PI_NEI = mean(PI, na.rm = TRUE)) %>%
tibble::add_column(.data = ., POP_ID = pop, .before = "PI_NEI")
return(pi.pop)
}#End pi_pop
# change POP_ID names
#' @name change_pop_names
#' @title Transform into a factor the POP_ID column, change names and reorder the levels
#' @description Transform into a factor the POP_ID column, change names and reorder the levels
#' @rdname change_pop_names
#' @export
#' @keywords internal
change_pop_names <- function(data, pop.levels = NULL, pop.labels = NULL) {
# checks ---------------------------------------------------------------------
if (missing(data)) stop("Input file missing")
# POP_ID in gsi_sim does not like spaces, we need to remove space in everything touching POP_ID...
# removing spaces in data$POP_ID, pop.levels and pop.labels
if (!is.null(pop.levels) & is.null(pop.labels)) {
pop.levels <- stringi::stri_replace_all_fixed(pop.levels, pattern = " ", replacement = "_", vectorize_all = FALSE)
pop.labels <- pop.levels
}
if (!is.null(pop.labels) & is.null(pop.levels)) stop("pop.levels is required if you use pop.labels")
if (!is.null(pop.labels)) {
if (length(pop.labels) != length(pop.levels)) stop("pop.labels and pop.levels must have the same length (number of groups)")
pop.labels <- stringi::stri_replace_all_fixed(pop.labels, pattern = " ", replacement = "_", vectorize_all = FALSE)
}
# in the data
data$POP_ID <- stringi::stri_replace_all_fixed(data$POP_ID, pattern = " ", replacement = "_", vectorize_all = FALSE)
# the number of pop in data == pop.levels
if (!is.null(pop.levels)) {
if (dplyr::n_distinct(data$POP_ID) != length(pop.levels)) {
stop("The number of groups in your input file doesn't match the number of groups in pop.levels")
}
}
# convert POP_ID to factor and change names-----------------------------------
if (is.null(pop.levels)) { # no pop.levels
data$POP_ID <- factor(data$POP_ID)
} else {# with pop.levels
data$POP_ID <- factor(x = data$POP_ID, levels = pop.levels, ordered = TRUE)
levels(data$POP_ID) <- pop.labels
}
return(data)
}# end function change_pop_names
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.