#' @name filter_hwe
#' @title Filter markers based on Hardy-Weinberg Equilibrium
#' @description Testing markers for Hardy-Weinberg proportions is a valuable tool
#' for the analysis and quality control of RADseq datasets.
#' HWE can highligh genotyping errors, presence of null alleles,
#' sequence duplication, copy number variation and other sequencing problems
#' related to read depth.
#' This function is designed for
#' \strong{bi-allelic markers} and uses the exact test from the package
#' HardyWeinberg. The function is speedy because it uses the
#' C++ code developed by Christopher Chang and also available in PLINK.
#' The p-value generated by the function is the mid p-value. It's computed
#' as half the probability of the current sample + the probabilities of all
#' samples that are more extreme (see references below). Several output are
#' generated to help users filter the data (see details).
#'
#' \strong{sampling sites vs well defined populations:} be careful what strata
#' you're investigating and adjust filtering threshold accordinghly for
#' downstream analysis. If you're still at the populations discovery steps,
#' markers under HWD are totally normal.
#'
#' Prior to HW filtering, I highly recommend removing outlier individuals,
#' filtering coverage and genotype likelihood (see details).
#' @param data A tidy data frame object in the global environment or
#' a tidy data frame in wide or long format in the working directory.
#' \emph{How to get a tidy data frame ?}
#' Look into \pkg{radiator} \code{\link{tidy_genomic_data}}.
#' @param filter.hwe (optional, logical) Used inside radiator pipeline.
#' Default: \code{filter.hwe = TRUE}.
#' @param hw.pop.threshold (integer, optional)
#' Remove markers that have a certain number of pops in Hardy-Weinberg
#' disequilibrium.
#' With default, all populations in dataset need to be in HWD before discarding
#' the marker.
#' Default: \code{hw.pop.threshold = NULL}.
#' @inheritParams radiator_common_arguments
#' @param midp.threshold (integer, optional)
#' By default the function generates blacklists/whitelists of markers and
#' filtered tidy datasets for the 5 mid p-value.
#' However, to get a final filtered object associated with the output of the
#' function, user need to choose one
#' of the 5 mid p-value:
#' \itemize{
#' \item \code{1} = 0.05 (*)
#' \item \code{2} = 0.001 (**)
#' \item \code{3} = 0.0001 (***)
#' \item \code{4} = 0.0001 (****)
#' \item \code{5} = 0.00001 (*****)
#' }.
#' With default, a very conservative mid p-value threshold is selected.
#' Default: \code{midp.threshold = 4}.
#' @param filename (optional) The function uses \code{\link[fst]{write.fst}},
#' to write the tidy data frame in
#' the folder created in the working directory. The file extension appended to
#' the \code{filename} provided is \code{.rad}.
#' Default: \code{filename = NULL}.
#' @inheritParams tidy_genomic_data
#' @inheritParams read_strata
#' @inheritParams radiator_common_arguments
#' @rdname filter_hwe
#' @export
#' @details
#' \strong{Interactive version}
#'
#' The user is asked to look at figures before choosing filter thresholds.
#'
#'
#' \strong{HWE threshold}
#'
#' I recommend starting with a low threshold.
#' Serious genotyping errors will generate extreme p-values (e.g. 1e-50),
#' which are detected by any reasonable configuration of this test,
#' while various life-history caracteristics will deflate/inflate
#' Hardy-Weinberg equilibrium.
#' Consequently, it's dangerous to choose a threshold that filters out too many
#' markers.
#'
#' \strong{strategies:}
#' Disk space is cheap! Consequently, the function will automatically generate
#' several blacklists/whitelists of markers and
#' filtered tidy data (in the directory)
#' based on the \code{hw.pop.threshold} for 5 groups of mid p-values:
#' \itemize{
#' \item 1: MID_P_VALUE <= 0.00001: *****
#' \item 2: MID_P_VALUE <= 0.0001: ****
#' \item 3: MID_P_VALUE <= 0.001: ***
#' \item 4: MID_P_VALUE <= 0.01: **
#' \item 5: MID_P_VALUE <= 0.05: *
#' }
#'
#' Test the sensitivity of downstream analysis and delete unwanted datasets.
#'
#' \strong{missing data}
#'
#' The mid-p adjustment tends to bring the null rejection rate in line with the
#' nominal p-value, and also reduces the filter's tendency to favor retention
#' of SNPs with missing data (Graffelman and Moreno, 2013, Purcell et al., 2007).
#'
#' If pattern of missing data is present in the dataset, or when missing data
#' accross markers vary by more than 0.10, you should not apply a single mid-p-value
#' threshold accross markers.
#'
#' \strong{Read depth, pooling lanes/chips and weird pattern of individual heterozygosity}
#'
#' Because of read depth, heterozygote deficiency is usually observed in RADseq data,
#' but if sequencing lanes/chips were combined to generate individuals with more
#' coverage, the situation will likely be the reverse: heterozygote excess.
#' If lanes/chips pooling was used or if highly variable sequencing coverage is
#' observed between individuals and/or markers, there's a couple qc and filtering
#' steps you should do before conducting HWE filtering.
#'
#' \itemize{
#' \item run \pkg{radiator} \code{\link{detect_mixed_genomes}} and
#' \code{\link{detect_het_outliers}} to highlight
#' outlier individuals with potential heterozygosity problems and get an idea of
#' the genotyping and heterozygote miscall rate
#' \item I recommend normalizing the data before \emph{de novo} assembly, if this
#' is not possible...
#' \item use radiator::filter_rad or a more chirurgical approach to
#' coverage and genotype likelihood with radiator::filter_rad.
#' }
#'
#'
#' \strong{Permutation test}
#' Hardy-Weinberg equilibrium refers to the statistical independence of alleles
#' within individuals. This independence can also be assessed by permutation test
#' inside the HardyWeinberg package of Jan Graffelman. To filter out markers with
#' genotyping problems the approach provided in this function is enough.
#'
#' \strong{Power test for HWE}
#' Based on allele count/frequency and sample size. This function is longer to
#' generate for each markers and is on my todo list to include it in this filter.
#'
#'
#' \strong{Markers under selection and genome scans:}
#'
#' Scared of deleting those precious markers or that the filter might interfere
#' with genome scan analysis/detection ? Don't be. Your markers or analysis is no
#' good if it's done on bad data... Test the sensitivity of your downstream
#' analysis with the datasets generated with the different thresholds.
#' @note
#' \strong{Hardy-Weinberg assumptions (refresh):}
#' \enumerate{
#' \item Diploid organisms
#' \item Only reproduction sexual occurs
#' \item Generations are non-overlapping
#' \item Mating is random
#' \item Population size is infinitely large
#' \item Allele frequencies are equal in the sexes
#' \item Migration, Mutation and Selection are negligible
#' }
#'
#' @return A list in the global environment objects:
#' \enumerate{
#' \item $path.folder: the path to the folder generated.
#' \item $hw.pop.threshold: the number of populations tolerated to be in HWD before
#' blacklisting the markers.
#' \item $plot.hwd.thresholds: useful figure that highlight the number of markers
#' blacklisted based on the number of populations in HWD and mid p-value thresholds.
#' \item $plot.tern: ternary plot of markers (currently unavailable until ggtern is updated).
#' \item $hw.manhattan: manhattan plot of markers in Hardy-Weinberg disequilibrium.
#' \item $hwe.pop.sum: a summary tibble with populations, number of markers in total,
#' number of markers monomorphic for the populations,
#' number of markers in Hardy-Weinberg Equilibrium (HWE),
#' number of markers in Hardy-Weinberg Dquilibrium (HWD) with all the different
#' mid p-values observed on the data.
#' \item $midp.threshold: the mid p-value threshold chosen for the final dataset (next)
#' \item $tidy.hw.filtered: the final filtered dataset (oter datasets \code{.rad}
#' are generated automatically by the function, check the folder)
#' }
#'
#' Written in the folder:
#' \enumerate{
#' \item genotypes.summary.tsv: A tibble with these columns:
#' \code{MARKERS, STRATA, HET, HOM_ALT, HOM_REF, MISSING, N,
#' FREQ_ALT, FREQ_REF, FREQ_HET, FREQ_HOM_REF_O, FREQ_HET_O, FREQ_HOM_ALT_O,
#' FREQ_HOM_REF_E, FREQ_HET_E, FREQ_HOM_ALT_E, N_HOM_REF_EXP,
#' N_HET_EXP, N_HOM_ALT_EXP, HOM_REF_Z_SCORE, HOM_HET_Z_SCORE,
#' HOM_ALT_Z_SCORE, READ_DEPTH}
#' \item hw.pop.sum.tsv: a summary tibble with populations, number of markers in total,
#' number of markers monomorphic for the populations,
#' number of markers in Hardy-Weinberg Equilibrium (HWE),
#' number of markers in Hardy-Weinberg Dquilibrium (HWD) with all the different
#' mid p-values observed on the data.
#' \item hwd.helper.table.tsv: useful tibble that highlight the number of markers
#' blacklisted based on the number of populations in HWD and mid p-value thresholds.
#' \item hwd.plot.blacklist.markers.pdf: useful figure that highlight the number of markers
#' blacklisted based on the number of populations in HWD and mid p-value thresholds.
#' \item hwe.manhattan.plot.pdf: manhattan plot of markers in Hardy-Weinberg disequilibrium.
# \item hwe.ternary.plots.missing.data.pdf: ternary plot of markers.
#' \item tidy.filtered.hwe.xxx.mid.p.value.xxx.hw.pop.threshold.rad: several
#' tidy dataset filtered with different mid p value and populations in HWD thresholds
#' \item whitelist.markers.hwe.xxx.mid.p.value.xxx.hw.pop.threshold.tsv: several
#' whitelist of markers with different mid p value and populations in HWD thresholds
#' \item blacklist.markers.hwd.xxx.mid.p.value.xxx.hw.pop.threshold.tsv: several
#' blacklist of markers with different mid p value and populations in HWD thresholds
#' }
#'
#'
#' @examples
#' \dontrun{
#' require(HardyWeinberg)
#' library(radiator)
#' # for the interactive version (recommended)
#' turtle.pop <- radiator::filter_hwe(
#' data = "turtle.vcf",
#' strata = "turtle.strata.tsv",
#' filename = "hwe.turtle"
#' )
#' }
#' @references Weir, B.S. (1996) Genetic data analysis II. Sinauer Associates,
#' Massachusetts. See Chapter3.
#'
#' @references Wigginton, J.E., Cutler, D.J. and Abecasis, G.R. (2005)
#' A note on exact tests of Hardy-Weinberg equilibrium,
#' American Journal of Human Genetics (76) pp. 887-893.
#'
#' @references Purcell et al. (2007) PLINK: A Toolset for Whole-Genome Association
#' and Population-Based Linkage Analysis.
#' American Journal of Human Genetics 81(3) pp. 559-575.
#'
#' @references Graffelman, J. and Moreno, V. (2013) The mid p-value in exact
#' tests for Hardy-Weinberg equilibrium, Statistical Applications in Genetics
#' and Molecular Biology 12(4) pp. 433-448.
#'
#' @references Graffelman J, Jain D, Weir B (2017)
#' A genome-wide study of Hardy-Weinberg equilibrium with next generation
#' sequence data. Human Genetics, 136, 727-741.
filter_hwe <- function(
data,
interactive.filter = TRUE,
filter.hwe = TRUE,
strata = NULL,
hw.pop.threshold = NULL,
midp.threshold = 4L,
filename = NULL,
parallel.core = parallel::detectCores() - 1,
verbose = TRUE,
...
) {
# obj.keeper <- c(ls(envir = globalenv()), "data")
if (interactive.filter || filter.hwe) {
if (interactive.filter) verbose <- TRUE
##Testing
# interactive.filter = TRUE
# filter.hwe <- TRUE
# # data <- gds
# strata = NULL
# hw.pop.threshold = NULL
# midp.threshold = 4L
# filename = NULL
# parallel.core = parallel::detectCores() - 1
# verbose = TRUE
# path.folder <- getwd()
# parameters = NULL
# internal <- FALSE
# required package
radiator_packages_dep(package = "HardyWeinberg")
# radiator_packages_dep(package = "ggtern")
radiator_function_header(f.name = "filter_hwe", verbose = verbose)
# Cleanup---------------------------------------------------------------------
file.date <- format(Sys.time(), "%Y%m%d@%H%M")
if (verbose) message("Execution date@time: ", file.date)
old.dir <- getwd()
opt.change <- getOption("width")
options(width = 70)
timing <- radiator_tic()
res <- list()
#back to the original directory and options
on.exit(setwd(old.dir), add = TRUE)
on.exit(options(width = opt.change), add = TRUE)
on.exit(radiator_toc(timing), add = TRUE)
on.exit(radiator_function_header(f.name = "filter_hwe", start = FALSE, verbose = verbose), add = TRUE)
# on.exit(rm(list = setdiff(ls(envir = sys.frame(-1L)), obj.keeper), envir = sys.frame(-1L)))
# Function call and dotslist -------------------------------------------------
rad.dots <- radiator_dots(
func.name = as.list(sys.call())[[1]],
fd = rlang::fn_fmls_names(),
args.list = as.list(environment()),
dotslist = rlang::dots_list(..., .homonyms = "error", .check_assign = TRUE),
keepers = c("path.folder", "parameters", "internal"),
verbose = FALSE
)
# Checking for missing and/or default arguments ------------------------------
if (missing(data)) rlang::abort("data is missing")
if (is.null(midp.threshold)) midp.threshold <- 4L
# Message about steps taken during the process ---------------------------------
if (interactive.filter) {
message("Interactive mode: on")
if (!is.null(hw.pop.threshold)) {
message("Disabling hw.pop.threshold value")
message(" you'll have to enter the value manually after visualization")
}
}
# Folders---------------------------------------------------------------------
path.folder <- generate_folder(
f = path.folder,
rad.folder = "filter_hwe",
internal = internal,
file.date = file.date,
verbose = verbose)
# write the dots file
write_rad(
data = rad.dots,
path = path.folder,
filename = stringi::stri_join("radiator_filter_hwe_args_", file.date, ".tsv"),
tsv = TRUE,
internal = internal,
write.message = "Function call and arguments stored in: ",
verbose = verbose
)
# File type detection----------------------------------------------------------
data.type <- radiator::detect_genomic_format(data)
if (data.type %in% c("SeqVarGDSClass", "gds.file")) {
if (data.type == "gds.file") {
data <- radiator::read_rad(data, verbose = verbose)
}
gds.bk <- data
data <- gds2tidy(gds = data, pop.id = FALSE, parallel.core = parallel.core)
data %<>% dplyr::rename(STRATA = tidyselect::any_of("POP_ID"))
data.type <- "tbl_df"
} else {
data.bk <- data
gds.bk <- NULL
}
if (data.type %in% c("tbl_df", "fst.file")) {
message(" using tidy data frame of genotypes as input")
message(" skipping all filters")
if (data.type == "fst.file") {
data <- radiator::read_rad(data)
}
} else {
data <- radiator::tidy_genomic_data(
data = data,
vcf.metadata = TRUE,
blacklist.genotype = NULL,
strata = strata,
filename = NULL,
parallel.core = parallel.core,
verbose = FALSE
)
}
# Filter parameter file: initiate ------------------------------------------
filters.parameters <- radiator_parameters(
generate = TRUE,
initiate = TRUE,
update = FALSE,
parameter.obj = parameters,
data = data,
path.folder = path.folder,
file.date = file.date,
internal = internal,
verbose = verbose)
# create a strata.df
strata <- radiator::generate_strata(data = data, pop.id = FALSE)
if (is.factor(strata$STRATA)) {
pop.id.levels <- levels(strata$STRATA)
} else {
pop.id.levels <- unique(strata$STRATA)
}
# Check that at least 10 ind/pop ---------------------------------------------
pop.removed <- dplyr::group_by(strata, STRATA) %>%
dplyr::tally(.) %>%
dplyr::filter(n < 10) %>%
dplyr::distinct(STRATA) %>%
dplyr::mutate(STRATA = as.character(STRATA)) %>%
purrr::flatten_chr(.)
if (length(pop.removed) > 0) {
message("\n\nStrata removed from analysis because n < 10: ", stringi::stri_join(pop.removed, collapse = ", "))
if (length(pop.removed) == length(pop.id.levels)) {
message("\n\nAll strata were removed. Stopping analysis. Returning original data.\n\n")
run.analysis <- FALSE
if (!is.null(gds.bk)) {
return(gds.bk)
} else {
return(data.bk)
}
} else {
run.analysis <- TRUE
message(" Note: removed strata are included back in datasets at the end\n\n")
data.temp <- dplyr::filter(data, STRATA %in% pop.removed)
data <- dplyr::filter(data, !STRATA %in% pop.removed)
if (is.factor(data$STRATA)) data$STRATA <- droplevels(data$STRATA)
strata <- dplyr::filter(strata, !STRATA %in% pop.removed)
}
} else {
data.temp <- NULL
run.analysis <- TRUE
}
# run.analysis --------------------------------------------------------------
if (run.analysis) {
# Check that data is biallelic
biallelic <- radiator::detect_biallelic_markers(data, parallel.core = parallel.core)
if (!biallelic) rlang::abort("This filter requires bi-allelic data")
# prepare filter, table and figure------------------------------------------
if (verbose) message("Summarizing data")
sample.size <- length(unique(data$INDIVIDUALS))
want <- c("MARKERS", "STRATA", "N", "MISSING", "HOM_REF", "HET", "HOM_ALT", "READ_DEPTH")
data.sum <- summarise_genotypes(data, path.folder = path.folder) %>%
dplyr::rename(STRATA = tidyselect::any_of("POP_ID")) %>%
dplyr::select(tidyselect::any_of(want)) %>%
dplyr::rename(AA = HOM_REF, AB = HET, BB = HOM_ALT)
if (is.factor(data.sum$STRATA)) {
pop.levels <- levels(data.sum$STRATA)
} else {
pop.levels <- unique(data.sum$STRATA)
}
n.pop <- length(pop.levels)
if (verbose) message("File written: genotypes.summary.tsv")
# HWE analysis -------------------------------------------------------------
data.sum <- hwe_analysis(x = data.sum, parallel.core = 1) %>%
dplyr::mutate(
STRATA = factor(STRATA, levels = pop.levels),
GROUPINGS = factor(
x = GROUPINGS,
levels = c("monomorphic", "hwe", "*", "**", "***", "****", "*****"),
labels = c("monomorphic", "hwe",
"midp <= 0.05 (*)",
"midp <= 0.01 (**)",
"midp <= 0.001 (***)",
"midp <= 0.0001 (****)",
"midp <= 0.00001 (*****)")),
MISSING_PROP = MISSING / (MISSING + N)
)
# Step 1. Impact of population threshold on marker discovery------------------
prop_join <- function(x, y) {
stringi::stri_join(x, " (", round(x / y, 3), ")")
}
hwe.pop.sum <- data.sum %>%
dplyr::group_by(STRATA) %>%
dplyr::summarise(
MARKERS_TOTAL = length(MARKERS),
MONOMORPHIC = length(MARKERS[MONO]),
HWE = length(MARKERS[HWE & !is.na(HWE)]),
`HWD*` = length(MARKERS[`*` & !is.na(`*`)]),
`HWD**` = length(MARKERS[`**` & !is.na(`**`)]),
`HWD***` = length(MARKERS[`***` & !is.na(`***`)]),
`HWD****` = length(MARKERS[`****` & !is.na(`****`)]),
`HWD*****` = length(MARKERS[`****` & !is.na(`*****`)])
) %>%
dplyr::ungroup(.) %>%
readr::write_tsv(x = ., file = file.path(path.folder, "hw.pop.sum.tsv"))
if (verbose) message("File written: hw.pop.sum.tsv")
hwd.markers.pop.sum <- data.sum %>%
dplyr::filter(!HWE, STRATA != "OVERALL") %>%
dplyr::select(STRATA, MARKERS, `*`, `**`, `***`, `****`, `*****`) %>%
radiator::rad_long(
x = .,
cols = c("MARKERS", "STRATA"),
names_to = "SIGNIFICANCE",
values_to = "VALUE",
variable_factor = FALSE
) %>%
dplyr::mutate(
SIGNIFICANCE = factor(SIGNIFICANCE,
levels = c("*", "**", "***", "****", "*****"))) %>%
dplyr::filter(VALUE) %>%
dplyr::group_by(MARKERS, SIGNIFICANCE) %>%
dplyr::tally(.) %>%
dplyr::rename(N_POP_HWD = n) #%>%dplyr::mutate(N_POP_HWD = n.pop -1 - N_POP_HWD)# To get the tolerance to...
hwd.levels <- sort(unique(hwd.markers.pop.sum$N_POP_HWD))
n.markers <- dplyr::n_distinct(data$MARKERS)
`Exact test mid p-value` <- NULL
overall <- data.sum %>%
dplyr::filter(!HWE, STRATA == "OVERALL") %>%
dplyr::select(STRATA, MARKERS, `*`, `**`, `***`, `****`, `*****`) %>%
radiator::rad_long(
x = .,
cols = c("MARKERS", "STRATA"),
names_to = "SIGNIFICANCE",
values_to = "VALUE",
variable_factor = FALSE
) %>%
dplyr::mutate(SIGNIFICANCE = factor(
x = SIGNIFICANCE,
levels = c("*", "**", "***", "****", "*****"))) %>%
dplyr::filter(VALUE) %>%
dplyr::group_by(MARKERS, SIGNIFICANCE) %>%
dplyr::tally(.) %>%
dplyr::rename(N_POP_HWD = n) %>%
dplyr::group_by(SIGNIFICANCE, N_POP_HWD) %>%
dplyr::tally(.) %>%
dplyr::ungroup(.) %>%
dplyr::mutate(N_POP_HWD = "OVERALL")
hwd.helper.table.long <- hwd.markers.pop.sum %>%
dplyr::group_by(SIGNIFICANCE, N_POP_HWD) %>%
dplyr::tally(.) %>%
dplyr::ungroup(.) %>%
dplyr::mutate(N_POP_HWD = as.character(N_POP_HWD)) %>%
dplyr::bind_rows(overall) %>%
dplyr::mutate(
N_POP_HWD = factor(
x = N_POP_HWD, levels = c("OVERALL", hwd.levels))
) %>%
tidyr::complete(
data = .,
SIGNIFICANCE, N_POP_HWD,
fill = list(n = 0))
overall <- NULL
hwd.helper.table <- radiator::rad_wide(
x = hwd.helper.table.long,
formula = "N_POP_HWD ~ SIGNIFICANCE",
values_from = "n"
) %>%
dplyr::filter(N_POP_HWD != 0) %>%
readr::write_tsv(x = ., file = file.path(path.folder, "hwd.helper.table.tsv"))
hwd.helper.table.long <- hwd.helper.table.long %>%
dplyr::rename(`Exact test mid p-value` = SIGNIFICANCE)
# dot plot with thresholds ---------------------------------------------------
#Function to replace packageplyr round_any
rounder <- function(x, accuracy, f = round) {
f(x / accuracy) * accuracy
}
max.markers <- max(hwd.helper.table.long$n)
if (max.markers >= 1000) {
y.breaks.by <- rounder(max.markers/10, 100, ceiling)
y.breaks.max <- rounder(max.markers, 1000, ceiling)
y.breaks <- seq(0, y.breaks.max, by = y.breaks.by)
} else {
y.breaks.by <- rounder(max.markers/10, 10, ceiling)
y.breaks.max <- rounder(max.markers, 100, ceiling)
y.breaks <- seq(0, y.breaks.max, by = y.breaks.by)
}
plot.hwd.thresholds <- ggplot2::ggplot(
data = hwd.helper.table.long,
ggplot2::aes(x = N_POP_HWD, y = n, colour = `Exact test mid p-value`, group = `Exact test mid p-value`)) +
ggplot2::geom_point(size = 2, shape = 21, fill = "white") +
ggplot2::geom_line() +
# ggplot2::scale_x_continuous(name = "Number of populations in HWD", breaks = 1:n.pop - 1) +
ggplot2::scale_y_continuous(name = "Number of markers blacklisted", breaks = y.breaks, limits = c(0, y.breaks.max)) +
ggplot2::labs(
x = "Number of populations in HWD",
title = "Number of markers blacklisted based on the number of populations in HWD\nand mid p-value thresholds") +
ggplot2::theme(
plot.title = ggplot2::element_text(size = 12, face = "bold", hjust = 0.5),
axis.title.x = ggplot2::element_text(size = 10, face = "bold"),
axis.title.y = ggplot2::element_text(size = 10, face = "bold"),
axis.text.x = ggplot2::element_text(size = 8),# , angle = 90, hjust = 1, vjust = 0.5),
strip.text.x = ggplot2::element_text(size = 10, face = "bold")
)
# plot.hwd.thresholds
if (interactive.filter) print(plot.hwd.thresholds)
ggplot2::ggsave(
limitsize = FALSE,
plot = plot.hwd.thresholds,
filename = file.path(path.folder, "hwd.plot.blacklist.markers.pdf"),
width = max(20, n.pop * 5), height = 10,
dpi = 300, units = "cm", useDingbats = FALSE)
hwd.helper.table.long <- NULL
if (verbose) message("Plot written: hwd.plot.blacklist.markers.pdf")
# if (interactive.filter) {
# message("Step 1. Ternary plot visualization")
# }
# HardyWeinberg::HWTernaryPlot(
# X = dplyr::filter(data, STRATA == "ATL") %>%
# dplyr::select(AA, AB, BB) %>% as.matrix, n = sample.size,
# region = 1,
# hwcurve = TRUE,
# verbose = TRUE)
# testing with duplicated info removed
# data.dup <- data2 %>%
# dplyr::distinct(MARKERS, AA, AB, BB)
# ternary plot -----------------------------------------------------------------
# library(ggtern)
# num.groups <- dplyr::n_distinct(data.sum$GROUPINGS)
# if (num.groups == 7) group_colors <- c("grey", "green", "yellow", "orange",
# "orangered", "red", "darkred")
# if (num.groups == 6) group_colors <- c("green", "yellow", "orange",
# "orangered", "red", "darkred")
# if (num.groups == 5) group_colors <- c("green", "yellow", "orange",
# "orangered", "red")
# if (num.groups == 4) group_colors <- c("green", "yellow", "orange",
# "orangered")
# if (num.groups == 3) group_colors <- c("green", "yellow", "orange")
# if (num.groups == 2) group_colors <- c("green", "yellow")
#
# # HW Parabola
# parabola <- tibble::tibble(p = seq(0, 1, by = 0.005)) %>%
# dplyr::mutate(AA = p^2, AB = 2 * p * (1 - p), BB = (1 - p)^2, p = NULL)
# sample.size <- data.sum %>% dplyr::group_by(STRATA) %>%
# dplyr::summarise(NN = 2* max(N, na.rm = TRUE))
#
# hw_parabola <- function(x, sample.size, parabola) {
# pop <- unique(x)
# pop.size <- sample.size$NN[sample.size$STRATA == pop]
# parabola <- parabola %>%
# dplyr::mutate(
# STRATA = pop,
# NN = pop.size,
# AA = AA * NN,
# AB = AB * NN,
# BB = BB * NN,
# NN = NULL,
# GROUPINGS = "hwe",
# MISSING_PROP = 0)
# return(parabola)
# }
# hw.parabola <- purrr::map_df(.x = pop.levels, .f = hw_parabola,
# sample.size = sample.size, parabola = parabola) %>%
# dplyr::mutate(STRATA = factor(STRATA, pop.levels))
# parabola <- sample.size <- NULL
# plot.tern <- ggtern::ggtern(
# data = data.sum,
# ggtern::aes(AA, AB, BB, color = GROUPINGS, size = MISSING_PROP)) +
# ggplot2::scale_color_manual(name = "Exact test mid p-value", values = group_colors) +
# ggplot2::scale_size_continuous(name = "Missing genotypes proportion") +
# ggplot2::geom_point(alpha = 0.4) +
# ggplot2::geom_line(data = hw.parabola, ggplot2::aes(x = AA, y = AB),
# linetype = 2, size = 0.6, colour = "black") +
# ggplot2::labs(
# x = "AA", y = "AB", z = "BB",
# title = "Hardy-Weinberg Equilibrium ternary plots",
# subtitle = "genotypes frequencies shown for AA: REF/REF, AB: REF/ALT and BB: ALT/ALT"
# ) +
# ggplot2::theme(
# plot.title = ggplot2::element_text(size = 12, face = "bold", hjust = 0.5),
# plot.subtitle = ggplot2::element_text(size = 10, hjust = 0.5)
# ) +
# ggtern::theme_rgbw() +
# ggtern::theme_nogrid_minor() +
# ggtern::theme_nogrid_major() +
# ggplot2::facet_wrap(~ STRATA)
# plot.tern
# ggtern::ggsave(
# limitsize = FALSE,
# plot = plot.tern,
# # filename = file.path(path.folder, "hwe.ternary.plots.read.depth.pdf"),
# filename = file.path(path.folder, "hwe.ternary.plots.missing.data.pdf"),
# width = max(25, n.pop * 5), height = max(15, n.pop * 4),
# dpi = 300, units = "cm", useDingbats = FALSE)
# hw.parabola <- NULL
# if (verbose) message("Plot written: hwe.ternary.plots.missing.data.pdf")
# plot.tern <- "temporarily out of order"
# Manhattan plot -------------------------------------------------------------
data.sum.man <- dplyr::mutate(data.sum, X = "x") %>%
dplyr::filter(MID_P_VALUE < 0.05)
# rounder <- function(x, accuracy, f = round) {
# f(x / accuracy) * accuracy
# }
# y.breaks.by <- rounder(max(data.sum$MID_P_VALUE, na.rm = TRUE)/10, 0.001, ceiling)
# y.breaks.max <- rounder(max(data.sum$MID_P_VALUE, na.rm = TRUE), 0.001, ceiling) + (y.breaks.by / 2)
# y.breaks.min <- rounder(min(data.sum$MID_P_VALUE, na.rm = TRUE), 0.001, ceiling) - (y.breaks.by / 2)
# y.breaks <- seq(y.breaks.min, y.breaks.max, by = y.breaks.by)
num.groups <- dplyr::n_distinct(data.sum.man$GROUPINGS)
if (num.groups == 5) group_colors <- c("yellow", "orange", "orangered", "red", "darkred")
if (num.groups == 4) group_colors <- c("yellow", "orange", "orangered", "red")
if (num.groups == 3) group_colors <- c("yellow", "orange", "orangered")
if (num.groups == 2) group_colors <- c("yellow", "orange")
hw.manhattan <- ggplot2::ggplot(
data = data.sum.man,
ggplot2::aes(x = X, y = MID_P_VALUE, color = GROUPINGS, size = MISSING_PROP)) +
ggplot2::geom_jitter(alpha = 0.5) +
ggplot2::scale_color_manual(name = "Exact test mid p-value", values = group_colors) +
ggplot2::scale_size_continuous(name = "Missing genotypes proportion") +
ggplot2::labs(
x = "Strata",
y = "Markers mid p-value",
title = "Manhanttan plot of markers in Hardy-Weinberg disequilibrium"
) +
ggplot2::theme_bw() +
ggplot2::theme(
panel.grid.major.x = ggplot2::element_blank(),
plot.title = ggplot2::element_text(size = 12, face = "bold", hjust = 0.5),
plot.subtitle = ggplot2::element_text(size = 10, hjust = 0.5),
axis.line.x = ggplot2::element_blank(),
# axis.title.x = ggplot2::element_blank(),
axis.text.x = ggplot2::element_blank(),
axis.ticks.x = ggplot2::element_blank(),
axis.title.y = ggplot2::element_text(size = 10, face = "bold"),
axis.text.y = ggplot2::element_text(size = 8)
) +
ggplot2::facet_grid(GROUPINGS ~ STRATA, scales = "free")
# if (interactive.filter) print(hw.manhattan)
ggplot2::ggsave(
filename = file.path(path.folder, "hwe.manhattan.plot.pdf"),
plot = hw.manhattan,
width = max(20, n.pop * 5), height = 30,
dpi = 300, units = "cm", useDingbats = FALSE, limitsize = FALSE)
if (verbose) message("Plot written: hwe.manhattan.plot.pdf")
# continue with filters or not ------
if (interactive.filter) {
hw.q <- "\nDo you want to continue with the filtering ? (y/n):"
do.hw <- radiator_question(x = hw.q, answer.opt = c("y", "n"))
} else {
do.hw <- "y"
}
if (do.hw == "y") {
# generate the blacklist/whitelist -------------------------------------------
# Generate blacklist of markers with the 4 significance groups
# threshold (integer) The number of outlier pop you tolerate to deviate from HWE.
# e.g. if \code{threshold = 2}, blacklist of markers out of HWE in more than (>=)
# 2 populations will be generated for all significance groupings.
#leave user with this figure before choosing threshold
if (is.null(hw.pop.threshold)) hw.pop.threshold <- n.pop - 1
if (interactive.filter) {
hw.pop.threshold <- radiator_question(
x = "\nBased on figures and tables enter the hw.pop.threshold (integer): ",
minmax = c(0, 100000000))
}
# Generating blacklists, whitelists and filtered tidy data -------------------
if (verbose) message("\nGenerating blacklists, whitelists and filtered tidy data...")
blacklist_hw(
x = hwd.markers.pop.sum,
unfiltered.data = data,
data.temp = data.temp,
hw.pop.threshold = hw.pop.threshold,
path.folder = path.folder,
pop.id.levels = pop.id.levels)
# Choosing the last dataset --------------------------------------------
no.file <- TRUE
while (no.file) {
if (interactive.filter) {
message("\nChoosing the final filtered dataset")
midp.threshold <- radiator_question(
x = " select the mid p-value threshold (5 options):\n1: 0.05 *\n2: 0.01 **\n3: 0.001 ***\n4: 0.0001 ****\n5: 0.00001 *****",
minmax = c(1, 5)
)
}
midp.threshold <- dplyr::case_when(
midp.threshold == 5 ~ 0.00001,
midp.threshold == 4 ~ 0.0001,
midp.threshold == 3 ~ 0.001,
midp.threshold == 2 ~ 0.01,
midp.threshold == 1 ~ 0.05) %>%
format(., scientific = FALSE)
# path.folder <- yft.hw$path.folder
data.filtered.name <- list.files(
path = path.folder,
pattern = stringi::stri_join("tidy.filtered.hwe.", midp.threshold),
full.names = FALSE)
if (length(data.filtered.name) == 0) {
message("No file associated with this threshold, choose again")
no.file <- TRUE
} else {
no.file <- FALSE
}
}
message("\nFinal filtered tidy dataset: \n", data.filtered.name)
message("\nUsing hw.pop.threshold/midp.threshold: ", hw.pop.threshold, "/", midp.threshold)
data <- list.files(
path = path.folder,
pattern = stringi::stri_join("tidy.filtered.hwe.", midp.threshold),
full.names = TRUE) %>%
radiator::read_rad(data = .)
}
# GDS ----------------------------------------------------------------------
if (!is.null(gds.bk)) {
# convert back or filter the gds....
markers.meta <- extract_markers_metadata(gds = gds.bk)
bl <- markers.meta %>%
dplyr::filter(FILTERS == "whitelist") %>%
dplyr::filter(!MARKERS %in% unique(data$MARKERS))
markers.meta %<>%
dplyr::mutate(
FILTERS = dplyr::if_else(MARKERS %in% bl$MARKERS, "filter.hwe", FILTERS)
)
data <- gds.bk
gds.bk <- bl <- NULL
update_radiator_gds(
gds = data,
node.name = "markers.meta",
value = markers.meta,
sync = TRUE)
}
}#End run.analysis
# Update parameters --------------------------------------------------------
filters.parameters <- radiator_parameters(
generate = FALSE,
initiate = FALSE,
update = TRUE,
parameter.obj = filters.parameters,
data = data,
filter.name = "Filter HWE",
param.name = paste0("hw.pop.threshold / midp.threshold"),
values = stringi::stri_join(hw.pop.threshold, midp.threshold,
collapse = " / ", ignore_null = FALSE),
path.folder = path.folder,
file.date = file.date,
internal = internal,
verbose = verbose)
# results ------------------------------------------------------------------
radiator_results_message(
rad.message = stringi::stri_join("Filter HWE: ", stringi::stri_join(hw.pop.threshold, " / ", midp.threshold,
ignore_null = FALSE)),
filters.parameters,
internal,
verbose
)
}
return(data)
}# End filter_hwe
# Internal nested functions ----------------------------------------------------
# hwe analysis
#' @title hwe_analysis
#' @description main hwe function
#' @rdname hwe_analysis
#' @keywords internal
#' @export
#' @author Thierry Gosselin \email{thierrygosselin@@icloud.com}
# Note to myself: will have to enable an option that generate different datasets
# to build confidence interval
# permuting populations
hwe_analysis <- function(x, parallel.core = parallel::detectCores() - 1) {
hwe_map <- function(x, parallel.core) {
pop <- as.character(unique(x$STRATA))
message("HWE analysis for pop: ", pop)
if (tibble::has_name(x, "STRATA")) x <- dplyr::select(x, -STRATA)
hwe_radiator <- carrier::crate(function(x, pop = NULL) {
`%>%` <- magrittr::`%>%`
`%<>%` <- magrittr::`%<>%`
mono <- function(x) {
mono <- length(x$AA[x$AA == 0]) + length(x$AB[x$AB == 0]) + length(x$BB[x$BB == 0])
if (mono >= 2) {
mono <- TRUE
} else {
mono <- FALSE
}
return(mono)
}
hw.res <- x %>%
dplyr::bind_cols(dplyr::rowwise(.) %>% dplyr::do(MONO = mono(.))) %>%
dplyr::mutate(MONO = unlist(MONO))
hw.poly <- dplyr::filter(hw.res, !MONO) %>%
dplyr::select(-MONO)
markers <- dplyr::select(hw.poly, MARKERS)
hw.poly <- dplyr::select(hw.poly, AA, AB, BB) #%>% as.matrix
hw.res <- dplyr::left_join(
hw.res,
dplyr::bind_cols(
markers,
tibble::tibble(
MID_P_VALUE = HardyWeinberg::HWExactStats(
X = hw.poly, plinkcode = TRUE, midp = TRUE)
)
)
, by = "MARKERS") %>%
dplyr::mutate(
HWE = MID_P_VALUE >= 0.05,
SIGNIFICANCE = dplyr::case_when(
MID_P_VALUE <= 0.00001 ~ "*****",
MID_P_VALUE <= 0.0001 ~ "****",
MID_P_VALUE <= 0.001 ~ "***",
MID_P_VALUE <= 0.01 ~ "**",
MID_P_VALUE <= 0.05 ~ "*"),
`*` = MID_P_VALUE <= 0.05,
`**` = MID_P_VALUE <= 0.01,
`***` = MID_P_VALUE <= 0.001,
`****` = MID_P_VALUE <= 0.0001,
`*****` = MID_P_VALUE <= 0.00001,
GROUPINGS = SIGNIFICANCE,
GROUPINGS = dplyr::if_else(
MONO, "monomorphic",
dplyr::if_else(HWE, "hwe", GROUPINGS))
) %>%
tibble::add_column(.data = ., STRATA = pop, .after = 1)
return(hw.res)
})#hwe_radiator
x <- radiator_future(
.x = x,
.f = hwe_radiator,
pop = pop,
flat.future = "dfr",
split.vec = TRUE,
split.with = NULL,
split.chunks = 10L,
parallel.core = parallel.core,
forking = TRUE
)
return(x)
}#hwe_map
x %<>%
dplyr::group_split(.tbl = ., STRATA, .keep = TRUE) %>%
purrr::map_df(.x = ., .f = hwe_map, parallel.core = parallel.core)
}#hwe_analysis
#' @title blacklist_hw
#' @description blacklist hw
#' @rdname blacklist_hw
#' @keywords internal
#' @export
#' @author Thierry Gosselin \email{thierrygosselin@@icloud.com}
blacklist_hw <- function(
x,
unfiltered.data,
data.temp,
hw.pop.threshold,
path.folder = NULL,
# filters.parameters.path,
pop.id.levels
) {
if (is.null(path.folder)) path.folder <- getwd()
x <- dplyr::ungroup(x) %>%
dplyr::filter(N_POP_HWD == hw.pop.threshold)
bl_map <- function(
x,
unfiltered.data,
hw.pop.threshold,
path.folder#filters.parameters.path
) {
if (nrow(x) > 0) {
significance.group <- unique(x$SIGNIFICANCE)
significance.group <- dplyr::case_when(
significance.group == "*****" ~ 0.00001,
significance.group == "****" ~ 0.0001,
significance.group == "***" ~ 0.001,
significance.group == "**" ~ 0.01,
significance.group == "*" ~ 0.05) %>%
format(., scientific = FALSE)
blacklist.filename <- file.path(
path.folder,
stringi::stri_join(
"blacklist.markers.hwd", significance.group, "mid.p.value",
hw.pop.threshold, "hw.pop.threshold", "tsv", sep = "."))
whitelist.filename <- file.path(
path.folder,
stringi::stri_join(
"whitelist.markers.hwe", significance.group, "mid.p.value",
hw.pop.threshold, "hw.pop.threshold", "tsv", sep = "."))
rad.filename <- file.path(
path.folder,
stringi::stri_join(
"tidy.filtered.hwe", significance.group, "mid.p.value",
hw.pop.threshold, "hw.pop.threshold", "rad", sep = "."))
# Generate the blacklist
want <- c("MARKERS", "CHROM", "LOCUS", "POS")
blacklist <- dplyr::distinct(x, MARKERS) %>%
dplyr::left_join(
dplyr::select(unfiltered.data, tidyselect::any_of(want)) %>%
dplyr::distinct(MARKERS, .keep_all = TRUE)
, by = "MARKERS") %>%
readr::write_tsv(
x = .,
file = blacklist.filename)
# Generate the rad data + the whitelist
if (!is.null(data.temp)) {
whitelist <- suppressWarnings(
unfiltered.data %>%
dplyr::bind_rows(data.temp) %>%
dplyr::mutate(STRATA = factor(STRATA, levels = pop.id.levels)) %>%
dplyr::filter(!MARKERS %in% blacklist$MARKERS))
radiator::write_rad(data = whitelist, path = rad.filename)
whitelist %>%
dplyr::select(tidyselect::any_of(want)) %>%
dplyr::distinct(MARKERS, .keep_all = TRUE) %>%
readr::write_tsv(x = ., file = whitelist.filename)
} else {
whitelist <- suppressWarnings(
unfiltered.data %>%
dplyr::filter(!MARKERS %in% blacklist$MARKERS))
radiator::write_rad(data = whitelist, path = rad.filename)
whitelist %>%
dplyr::select(tidyselect::any_of(want)) %>%
dplyr::distinct(MARKERS, .keep_all = TRUE) %>%
readr::write_tsv(x = ., file = whitelist.filename)
}
}
}#End bl_map
split(x = x, f = x$SIGNIFICANCE) %>%
purrr::walk(
.x = ., .f = bl_map,
unfiltered.data = unfiltered.data,
hw.pop.threshold = hw.pop.threshold,
path.folder = path.folder)
return(message("done!"))
}#End blacklist_hw
#' @title update filter parameter file
#' @description update filter parameter file
#' @rdname update_filter_parameter
#' @keywords internal
#' @export
#' @author Thierry Gosselin \email{thierrygosselin@@icloud.com}
update_filter_parameter <- function(filter, unfiltered,
filtered, parameter,
threshold, param.path) {
# Update filters.parameters SNP ----------------------------------------------
snp.before <- as.integer(dplyr::n_distinct(unfiltered$MARKERS))
snp.after <- as.integer(dplyr::n_distinct(filtered$MARKERS))
snp.blacklist <- snp.before - snp.after
if (tibble::has_name(unfiltered, "LOCUS")) {
locus.before <- as.integer(dplyr::n_distinct(unfiltered$LOCUS))
locus.after <- as.integer(dplyr::n_distinct(filtered$LOCUS))
locus.blacklist <- locus.before - locus.after
} else {
locus.before <- as.character("NA")
locus.after <- as.character("NA")
locus.blacklist <- as.character("NA")
}
markers.before <- stringi::stri_join(snp.before, locus.before, sep = "/")
markers.after <- stringi::stri_join(snp.after, locus.after, sep = "/")
markers.blacklist <- stringi::stri_join(snp.blacklist, locus.blacklist, sep = "/")
filters.parameters <- tibble::tibble(
FILTERS = filter,
PARAMETERS = parameter,
VALUES = threshold,
BEFORE = markers.before,
AFTER = markers.after,
BLACKLIST = markers.blacklist,
UNITS = c("SNP/LOCUS"),
COMMENTS = c("")
) %>%
readr::write_tsv(x = ., file = param.path,
append = TRUE, col_names = FALSE)
return(res = list(filters.parameters, markers.before, markers.blacklist, markers.after))
}#End update_filter_parameter
# testing
# test using nest and purrr map
# test <- data %>% dplyr::group_by(STRATA) %>%
# tidyr::nest(data = ., .key = "DATA") %>%
# dplyr::filter(STRATA %in% c("ATL", "MAL")) %>%
# dplyr::mutate(ANALYSIS = purrr::map(.x = .$DATA, .f = hwe_analysis)) %>%
# dplyr::select(-DATA) %>%
# tidyr::unnest(.)
# test using furrr and future
# library(furrr)
# oplan <- future::plan()
# future::plan(multiprocess(workers = parallel.core))
#
# if (dplyr::n_distinct(data$STRATA) > 2 * parallel.core) {
# future.scheduling <- 2L
# } else {
# future.scheduling <- 1L
# }
#
# system.time(
# test <- data %>%
# dplyr::filter(!STRATA %in% c("OVERALL")) %>%
# dplyr::group_by(STRATA) %>%
# tidyr::nest(data = ., .key = "DATA") %>%
# dplyr::mutate(ANALYSIS = furrr::future_map(
# .x = .$DATA,
# .f = hwe_analysis,
# future.scheduling = future.scheduling)) %>%
# dplyr::select(-DATA) %>%
# tidyr::unnest(.)
# )
# on.exit(plan(oplan), add = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.