# detect mixed genomes
#' @title Detect mixed genomes
#' @description Highlight outliers individual's observed heterozygosity for a quick
#' diagnostic of mixed samples or poor polymorphism discovery due to DNA quality,
#' sequencing effort, etc.
#' @param data (4 options) A file or object generated by radiator:
#' \itemize{
#' \item tidy data
#' \item Genomic Data Structure (GDS)
#' }
#'
#' \emph{How to get GDS and tidy data ?}
#' Look into \code{\link{tidy_genomic_data}},
#' \code{\link{read_vcf}} or
#' \code{\link{tidy_vcf}}.
#' @param detect.mixed.genomes (optional, logical) For use inside radiator pipelines.
#' Default: \code{detect.mixed.genomes = TRUE}.
#' @param ind.heterozygosity.threshold (string, double, optional)
#' Blacklist individuals based on observed heterozygosity (averaged across markers).
#'
#'
#' The string contains 2 thresholds values (min and max).
#' The values are proportions (0 to 1), where 0 turns off the min threshold and
#' 1 turns off the max threshold.
#' Individuals with mean observed heterozygosity higher (>) or lower (<)
#' than the thresholds will be blacklisted.
#'
#' Default: \code{ind.heterozygosity.threshold = NULL} will turn off completely
#' the filter and the function will only output the plots and table of heterozygosity.
#' @param by.strata (optional, logical)
#' Can only be used when \code{interactive.filter = TRUE}, allows to enter
#' heterozygosity thresholds by strata, instead of overall. This is not recommended
#' to use inside \code{\link{filter_rad}} or when doing population discovery.
#' This is a good use when dealing with different species.
#' Default: \code{by.strata = FALSE}.
#' @inheritParams radiator_common_arguments
#' @details To help discard an individual based on his observed heterozygosity
#' (averaged across markers),
#' use the manhanttan plot to:
#' \enumerate{
#' \item contrast the individual with population and overall samples.
#' \item visualize the impact of missigness information (based on population or
#' overall number of markers) and the individual observed heterozygosity. The
#' larger the point, the more missing genotypes.
#' }
#' \strong{Outlier above average:}
#' \itemize{
#' \item potentially represent two samples mixed together (action: blacklist), or...
#' \item a sample with more sequecing effort (point size small): did you merge your replicates fq files ? (action : keep and monitor)
#' \item a sample with poor sequencing effort (point size large) where the genotyped markers are
#' all heterozygotes, verify this with missingness (action: discard)
#' }
#' In all cases, if there is no bias in the population sequencing effort,
#' the size of the point will usually be "average" based on the population or
#' overall number of markers.
#'
#'
#' You can visualize individual observed heterozygosity, choose thresholds and
#' then visualize, choose thresholds and filter markers based on observed
#' heterozygosity in one run with: \pkg{radiator} \code{\link{filter_het}}.
#'
#' \strong{Outlier below average:}
#' \itemize{
#' \item A point with a size larger than the population or overall average (= lots of missing):
#' the poor polymorphism discovery of the sample is probably the result of bad
#' DNA quality, a bias in sequencing effort, etc. (action: blacklist)
#' \item A point with a size that looks average (not much missing):
#' this sample requires more attention (action: blacklist) and conduct more tests.
#' e.g. for biallelic data, look for coverage imbalance between ALT/REF allele.
#' At this point you need to distinguish between an artifact of poor polymorphism discovery
#' or a biological reason (highly inbred individual, etc.).
#' }
#'
#' \strong{heterozygosity and missing data:}
#' If you see a pattern with the heterozygosity and missing data,
#' try changing the genotyping rate required to keep markers and/or individuals.
#' @return The function returns inside the global environment a list with
#' 5 objects:
#'
#' \enumerate{
#' \item the individual's heterozigosity (\code{$individual.heterozygosity})
#' a dataframe containing for each individual, the population id, the number of
#' genotyped markers, the number of missing genotypes (based on the number of
#' markers of the population and overall), the number of markers genotyped as heterozygote
#' and it's proportion based on the number of genotyped markers.
#' \item the heterozygosity statistics per populations and overall:\code{$heterozygosity.statistics}
#' \item the blacklisted individuals if \code{ind.heterozygosity.threshold} was selected: \code{$blacklist.ind.het}
#' \item the boxplot of individual heterozygosity:\code{$individual.heterozygosity.boxplot}
#' \item the manhattan plot of individual heterozygosity (\code{$individual.heterozygosity.manhattan.plot})
#' contrasted with missingness proportion based on the number of markers (population or overall).
#' The 2 facets will be identical when the dataset as common markers between
#' the populations. The dotted lines are the mean hetegozygosities.
#' }
#' @rdname detect_mixed_genomes
#' @export
#' @examples
#' \dontrun{
#' #Step1: highlight outlier individuals, the simplest way to run:
#'
#' data <- radiator::detect_mixed_genomes(data = "wombat_tidy.rad")
#' # where the .rad file is a tidy data set generated by radiator.
#'
#' # Or if you don't have a tidy df:
#'
#' data <- radiator::tidy_genomic_data(
#' data = "wombat.vcf",
#' strata = "strata.wombat.tsv"
#' ) %>%
#' radiator::detect_mixed_genomes(.)
#'
#' }
#' @note Thanks to Peter Grewe for very useful comments
#' on previous version of this function.
#' @author Thierry Gosselin \email{thierrygosselin@@icloud.com}
detect_mixed_genomes <- function(
data,
interactive.filter = TRUE,
detect.mixed.genomes = TRUE,
ind.heterozygosity.threshold = NULL,
by.strata = FALSE,
verbose = TRUE,
parallel.core = parallel::detectCores() - 1,
...
) {
## Test
# interactive.filter = TRUE
# detect.mixed.genomes = TRUE
# ind.heterozygosity.threshold = NULL
# verbose = TRUE
# path.folder <- NULL
# parameters = NULL
# parallel.core = parallel::detectCores() - 1
# by.strata = FALSE
# internal <- FALSE
if (interactive.filter || detect.mixed.genomes) {
if (interactive.filter) verbose <- TRUE
# Cleanup-------------------------------------------------------------------
radiator_function_header(f.name = "detect_mixed_genomes", verbose = verbose)
file.date <- format(Sys.time(), "%Y%m%d@%H%M")
if (verbose) message("Execution date@time: ", file.date)
old.dir <- getwd()
opt.change <- getOption("width")
options(width = 70)
timing <- radiator_tic()
#back to the original directory and options
on.exit(setwd(old.dir), add = TRUE)
on.exit(options(width = opt.change), add = TRUE)
on.exit(radiator_toc(timing, verbose = verbose), add = TRUE)
on.exit(radiator_function_header(f.name = "detect_mixed_genomes", start = FALSE, verbose = verbose), add = TRUE)
# Function call and dotslist -------------------------------------------------
rad.dots <- radiator_dots(
func.name = as.list(sys.call())[[1]],
fd = rlang::fn_fmls_names(),
args.list = as.list(environment()),
dotslist = rlang::dots_list(..., .homonyms = "error", .check_assign = TRUE),
keepers = c("path.folder", "parameters", "internal"),
verbose = FALSE
)
# Checking for missing and/or default arguments ------------------------------
if (missing(data)) rlang::abort("data is missing")
# Folders---------------------------------------------------------------------
path.folder <- generate_folder(
f = path.folder,
rad.folder = "detect_mixed_genomes",
internal = internal,
file.date = file.date,
verbose = verbose)
# write the dots file
write_rad(
data = rad.dots,
path = path.folder,
filename = stringi::stri_join("radiator_detect_mixed_genomes_args_", file.date, ".tsv"),
tsv = TRUE,
internal = internal,
verbose = verbose
)
# Detect format --------------------------------------------------------------
data.type <- radiator::detect_genomic_format(data)
if (!data.type %in% c("SeqVarGDSClass", "gds.file")) {
# Tidy data
data <- radiator::tidy_wide(data = data, import.metadata = TRUE)
n.pop <- dplyr::n_distinct(data$STRATA)
# Filter parameter file: generate and initiate -------------------------------
filters.parameters <- radiator_parameters(
generate = TRUE,
initiate = TRUE,
update = FALSE,
parameter.obj = parameters,
data = data,
path.folder = path.folder,
file.date = file.date,
internal = internal,
verbose = verbose)
# highlight heterozygote and missing (optimized for speed depending on data)
# you see the difference with > 30K SNP
if (tibble::has_name(data, "GT_BIN")) {
n.markers.pop <- dplyr::filter(data, !is.na(GT_BIN))
n.markers.overall <- dplyr::n_distinct(data$MARKERS[!is.na(data$GT_BIN)])
} else {
n.markers.pop <- dplyr::filter(data, GT != "000000")
n.markers.overall <- dplyr::n_distinct(data$MARKERS[data$GT != "000000"])
}
n.markers.pop <- dplyr::distinct(n.markers.pop, MARKERS, STRATA) %>%
dplyr::count(x = ., STRATA)
# For figure het and missingness:
# 1 or 2 facets that include missingness computed by pop or overall
missing.group <- "1"
if (length(unique(n.markers.pop$n)) == 1 && unique(n.markers.pop$n) == n.markers.overall) {
missing.group <- "1"
} else {
missing.group <- "2"
}
if (tibble::has_name(data, "GT_BIN")) {
het.summary <- dplyr::mutate(
.data = data,
HET = dplyr::if_else(GT_BIN == 1, 1, 0, missing = 0)) %>%
dplyr::group_by(INDIVIDUALS) %>%
dplyr::mutate(GENOTYPED = length(GT_BIN[!is.na(GT_BIN)])) %>%
dplyr::ungroup(.)
} else {
het.summary <- dplyr::mutate(
.data = data,
HET = dplyr::if_else(
stringi::stri_sub(GT, 1, 3) != stringi::stri_sub(GT, 4, 6), 1, 0)) %>%
dplyr::group_by(INDIVIDUALS) %>%
dplyr::mutate(GENOTYPED = length(INDIVIDUALS[GT != "000000"])) %>%
dplyr::ungroup(.)
}
# Step 1. Highlight individual's heterozygosity -----------------------------
# Heterozygosity at the individual level before looking at the markers level per population
# It's a good way to do outlier diagnostic ... mixed individuals
# Create a new df with heterozygote info
if (is.factor(het.summary$STRATA)) {
pop.levels <- levels(het.summary$STRATA)
} else {
pop.levels <- unique(het.summary$STRATA)
}
het.ind <- dplyr::select(.data = het.summary, STRATA, INDIVIDUALS, HET, GENOTYPED) %>%
dplyr::full_join(n.markers.pop, by = "STRATA") %>%
dplyr::group_by(INDIVIDUALS) %>%
dplyr::mutate(
MISSING_PROP_POP = (n - GENOTYPED) / n,
MISSING_PROP_OVERALL = (n.markers.overall - GENOTYPED) / n.markers.overall
) %>%
dplyr::group_by(INDIVIDUALS, STRATA) %>%
dplyr::summarise(
GENOTYPED = unique(GENOTYPED),
MISSING_PROP_POP = unique(MISSING_PROP_POP),
MISSING_PROP_OVERALL = unique(MISSING_PROP_OVERALL),
HET_NUMBER = length(HET[HET == 1]),
HET_PROP = HET_NUMBER / GENOTYPED,
.groups = "keep"
) %>%
dplyr::arrange(STRATA, HET_PROP) %>%
dplyr::ungroup(.) %>%
readr::write_tsv(
x = .,
file = file.path(path.folder, "individual.heterozygosity.tsv"),
col_names = TRUE)
het.ind.overall <- dplyr::mutate(.data = het.ind, STRATA = as.character(STRATA)) %>%
dplyr::bind_rows(dplyr::mutate(.data = het.ind, STRATA = rep("OVERALL", n()))) %>%
dplyr::mutate(STRATA = factor(STRATA, levels = c(pop.levels, "OVERALL"))) %>%
radiator::rad_long(
x = .,
cols = c("STRATA", "INDIVIDUALS", "GENOTYPED", "HET_NUMBER", "HET_PROP"),
names_to = "MISSING_GROUP",
values_to = "MISSING_PROP",
variable_factor = FALSE
) %>%
dplyr::mutate(MISSING_GROUP = factor(MISSING_GROUP, levels = c("MISSING_PROP_POP", "MISSING_PROP_OVERALL"))) %>%
dplyr::arrange(STRATA, MISSING_GROUP)
} else {
if (data.type == "gds.file") {
data <- radiator::read_rad(data, verbose = verbose)
data.type <- "SeqVarGDSClass"
}
radiator_packages_dep(package = "SeqArray", cran = FALSE, bioc = TRUE)
# Filter parameter file: generate and initiate -------------------------------
filters.parameters <- radiator_parameters(
generate = TRUE,
initiate = TRUE,
update = FALSE,
parameter.obj = parameters,
data = data,
path.folder = path.folder,
file.date = file.date,
internal = internal,
verbose = verbose)
# GDS....
missing.group <- "1"
id.stats <- generate_stats(
gds = data,
markers = FALSE,
missing = TRUE,
heterozygosity = TRUE,
coverage = FALSE,
allele.coverage = FALSE,
exhaustive = FALSE,
path.folder = path.folder,
plot = FALSE,
file.date = file.date,
parallel.core = parallel.core,
verbose = FALSE)
het.ind <- id.stats$i.info %>%
dplyr::select(INDIVIDUALS, STRATA, HET_PROP = HETEROZYGOSITY, MISSING_PROP_OVERALL = MISSING_PROP) %>%
dplyr::mutate(MISSING_PROP_POP = MISSING_PROP_OVERALL)
# Need the equivalent for this:
if (is.factor(het.ind$STRATA)) {
pop.levels <- levels(het.ind$STRATA)
} else {
pop.levels <- unique(het.ind$STRATA)
}
n.pop <- length(pop.levels)
id.stats <- NULL
het.ind.overall <- dplyr::mutate(.data = het.ind, STRATA = as.character(STRATA)) %>%
dplyr::bind_rows(dplyr::mutate(.data = het.ind, STRATA = rep("OVERALL", n()))) %>%
dplyr::mutate(STRATA = factor(STRATA, levels = c(pop.levels, "OVERALL"), ordered = TRUE)) %>%
radiator::rad_long(
x = .,
cols = c("STRATA", "INDIVIDUALS", "HET_PROP"),
names_to = "MISSING_GROUP",
values_to = "MISSING_PROP"
) %>%
dplyr::mutate(
MISSING_GROUP = factor(MISSING_GROUP,
levels = c("MISSING_PROP_POP", "MISSING_PROP_OVERALL")),
MISSING_PROP = as.numeric(MISSING_PROP)) %>%
dplyr::arrange(STRATA, MISSING_GROUP)
}
#Stats------------------------------------------------------------------------
message("Calculating heterozygosity statistics")
het.ind.stats <- dplyr::filter(het.ind.overall,
MISSING_GROUP != "MISSING_PROP_POP") %>%
dplyr::group_by(STRATA) %>%
dplyr::summarise(
MEAN = mean(HET_PROP, na.rm = TRUE),
MEDIAN = stats::median(HET_PROP, na.rm = TRUE),
SD = stats::sd(HET_PROP, na.rm = TRUE),
Q25 = stats::quantile(HET_PROP, 0.25, na.rm = TRUE),
Q75 = stats::quantile(HET_PROP, 0.75, na.rm = TRUE),
IQR = stats::IQR(HET_PROP, na.rm = TRUE),
MIN = min(HET_PROP, na.rm = TRUE),
MAX = max(HET_PROP, na.rm = TRUE),
OUTLIERS_LOW = Q25 - (1.5 * IQR),
OUTLIERS_HIGH = Q75 + (1.5 * IQR),
OUTLIERS_LOW_N = length(HET_PROP[HET_PROP < OUTLIERS_LOW]),
OUTLIERS_HIGH_N = length(HET_PROP[HET_PROP > OUTLIERS_HIGH]),
OUTLIERS_TOTAL = OUTLIERS_HIGH_N + OUTLIERS_LOW_N,
OUTLIERS_PROP = round(OUTLIERS_TOTAL / length(HET_PROP), 3)) %>%
dplyr::mutate(dplyr::across(where(is.numeric), .fns = round, digits = 6)) %>%
tidyr::unite(data = ., HET_RANGE, MIN, MAX, sep = " - ") %>%
dplyr::arrange(STRATA) %>%
readr::write_tsv(
x = .,
file = file.path(path.folder, "heterozygosity.statistics.tsv"),
col_names = TRUE)
# Plots ----------------------------------------------------------------------
message("Generating plots")
# outlier info
overall.outlier.low <- het.ind.stats$OUTLIERS_LOW[het.ind.stats$STRATA == "OVERALL"]
overall.outlier.high <- het.ind.stats$OUTLIERS_HIGH[het.ind.stats$STRATA == "OVERALL"]
rounder <- function(x, accuracy, f = round) {
f(x / accuracy) * accuracy
}
y.breaks.by <- rounder(max(het.ind$HET_PROP, na.rm = TRUE)/10, 0.001, ceiling)
y.breaks.max <- rounder(max(het.ind$HET_PROP, na.rm = TRUE), 0.001, ceiling) + (y.breaks.by / 2)
y.breaks.min <- rounder(min(het.ind$HET_PROP, na.rm = TRUE), 0.001, ceiling) - (y.breaks.by / 2)
y.breaks <- seq(y.breaks.min, y.breaks.max, by = y.breaks.by)
# labeller to rename in the facet_grid or facet_wrap call:
if (missing.group == "2") {
facet_names <- ggplot2::as_labeller(c(`MISSING_PROP_OVERALL` = "Missing (overall)",
`MISSING_PROP_POP` = "Missing (populations)"))
} else {
facet_names <- ggplot2::as_labeller(c(`MISSING_PROP_OVERALL` = "Missing (overall)"))
het.ind.overall <- dplyr::filter(het.ind.overall, MISSING_GROUP != "MISSING_PROP_POP")
}
# manhattan ----------------------------------------------------------------
het.manhattan <- ggplot2::ggplot(
data = het.ind.overall,
ggplot2::aes(x = STRATA, y = HET_PROP, size = as.numeric(MISSING_PROP), colour = STRATA)) +
ggplot2::geom_jitter(alpha = 0.6) +
ggplot2::scale_y_continuous(name = "Mean Observed Heterozygosity (proportion)",
breaks = y.breaks, labels = y.breaks, limits = c(y.breaks.min, y.breaks.max)) + #y.breaks
ggplot2::scale_color_discrete(guide = "none") +
ggplot2::scale_size_continuous(name = "Missing proportion") +
ggplot2::labs(
title = "Genome-wide individual's mean observed heterozygosity",
subtitle = stringi::stri_join(
"overall data outlier thresholds (low/high): ", overall.outlier.low, "/",
overall.outlier.high)
) +
ggplot2::theme(
panel.grid.major.x = ggplot2::element_blank(),
plot.title = ggplot2::element_text(size = 12, family = "Helvetica", face = "bold", hjust = 0.5),
plot.subtitle = ggplot2::element_text(size = 10, family = "Helvetica", hjust = 0.5),
axis.line.x = ggplot2::element_blank(),
axis.title.x = ggplot2::element_blank(),
axis.text.x = ggplot2::element_blank(),
axis.ticks.x = ggplot2::element_blank(),
axis.title.y = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
axis.text.y = ggplot2::element_text(size = 8, family = "Helvetica")
) +
ggplot2::geom_hline(mapping = ggplot2::aes(yintercept = OUTLIERS_LOW),
het.ind.stats, linetype = "dashed", size = 0.6) + #low
ggplot2::geom_hline(mapping = ggplot2::aes(yintercept = MEAN),
het.ind.stats, linetype = "dotted", size = 0.6) +#mean
ggplot2::geom_hline(mapping = ggplot2::aes(yintercept = OUTLIERS_HIGH),
het.ind.stats, linetype = "dashed", size = 0.6) + #high
ggplot2::facet_grid(MISSING_GROUP ~ factor(STRATA), switch = "x", scales = "free",
labeller = ggplot2::labeller(MISSING_GROUP = facet_names))
# het.manhattan
# n.pop <- 20
ggplot2::ggsave(
filename = file.path(path.folder, "individual.heterozygosity.manhattan.plot.pdf"),
plot = het.manhattan,
width = max(5 * n.pop, 20), height = 10 * as.numeric(missing.group),
dpi = 600, units = "cm", useDingbats = FALSE, limitsize = FALSE)
ggplot2::ggsave(
filename = file.path(path.folder, "individual.heterozygosity.manhattan.plot.png"),
plot = het.manhattan,
width = max(5 * n.pop, 20),
height = 10 * as.numeric(missing.group),
dpi = 200,
units = "cm",
limitsize = FALSE
)
het.bp <- ggplot2::ggplot(data = het.ind.overall,
ggplot2::aes(x = STRATA, y = HET_PROP, colour = STRATA)) +
ggplot2::geom_boxplot() +
ggplot2::labs(
x = "Populations",
title = "Genome-wide individual's mean observed heterozygosity",
subtitle = stringi::stri_join(
"overall data outlier thresholds (low/high): ", overall.outlier.low, "/",
overall.outlier.high),
colour = "Populations") +
ggplot2::scale_y_continuous(
name = "Mean Observed Heterozygosity (proportion)",
breaks = y.breaks, labels = y.breaks, limits = c(y.breaks.min, y.breaks.max)) + #y.breaks
# breaks = y.breaks, limits = c(0, y.breaks.max), expand = c(0.06, 0)) +
ggplot2::theme_classic() +
ggplot2::theme(
legend.position = "none",
plot.title = ggplot2::element_text(size = 12, family = "Helvetica", face = "bold", hjust = 0.5),
plot.subtitle = ggplot2::element_text(size = 10, family = "Helvetica", hjust = 0.5),
axis.title.x = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
axis.text.x = ggplot2::element_text(size = 10, family = "Helvetica"),
axis.title.y = ggplot2::element_text(size = 10, family = "Helvetica", face = "bold"),
axis.text.y = ggplot2::element_text(size = 8, family = "Helvetica")
)
# het.bp
ggplot2::ggsave(
filename = file.path(path.folder, "individual.heterozygosity.boxplot.pdf"),
plot = het.bp,
width = max(4 * n.pop, 20), height = 10, dpi = 600, units = "cm",
useDingbats = FALSE, limitsize = FALSE)
ggplot2::ggsave(
filename = file.path(path.folder, "individual.heterozygosity.boxplot.png"),
plot = het.bp,
width = max(4 * n.pop, 20), height = 10, dpi = 200, units = "cm",
limitsize = FALSE)
het.ind.overall <- NULL
if (interactive.filter) {
message("\nThe greatest value of a picture is when it forces us
to notice what we never expected to see.
\nJohn W. Tukey. Exploratory Data Analysis. 1977.")
message("\n\nInspect plots and tables in folder created...")
bl.id.het <- radiator_question(
x = " Do you want to exclude individuals based on heterozygosity ? (y/n): ", answer.opt = c("y", "n"))
if (bl.id.het == "y") {
if (by.strata) {
ind.het.thresholds.by.strata <- het.ind.stats %>%
dplyr::select(
STRATA,
THRESHOLD_OUTLIERS_LOW = OUTLIERS_LOW,
THRESHOLD_OUTLIERS_HIGH = OUTLIERS_HIGH
) %>%
dplyr::filter(STRATA != "OVERALL")
readr::write_tsv(
x = ind.het.thresholds.by.strata,
file = file.path(path.folder, "ind.het.thresholds.by.strata.tsv"),
append = FALSE,
col_names = TRUE
)
message("\nFile written: ind.het.thresholds.by.strata.tsv")
message("In directory: ", path.folder)
message("\nOpen in a text editor or MS Excel and adjust the thresholds by strata")
finished <- radiator_question(
x = "\nWhen finished with the file, save and come back here and type: `y`:",
answer.opt = c("y", "Y", "yes", "Yes", "YES"))
ind.het.thresholds.by.strata <- readr::read_tsv(
file = file.path(path.folder, "ind.het.thresholds.by.strata.tsv"),
col_types = "cdd")
ind.heterozygosity.threshold <- ind.het.thresholds.by.strata
} else {
mix.text <- " Enter the min value for ind.heterozygosity.threshold argument (0 turns off): "
threshold.min <- radiator_question(x = mix.text, minmax = c(0, 1))
mix.text <- " Enter the max value for ind.heterozygosity.threshold argument (1 turns off): "
# threshold.max <- as.numeric(readLines(n = 1))
threshold.max <- radiator_question(x = mix.text, minmax = c(0, 1))
ind.heterozygosity.threshold <- as.numeric(c(threshold.min, threshold.max))
}
} else {
threshold.min <- 0
threshold.max <- 1
ind.heterozygosity.threshold <- NULL
}
} else {
if (!is.null(ind.heterozygosity.threshold)) {
threshold.min <- ind.heterozygosity.threshold[1]
threshold.max <- ind.heterozygosity.threshold[2]
}
}
## Step 2: Blacklist outlier individuals -------------------------------------
if (!is.null(ind.heterozygosity.threshold)) {
if (by.strata) {
blacklist.ind.het <- dplyr::ungroup(het.ind) %>%
dplyr::right_join(ind.het.thresholds.by.strata, by = "STRATA") %>%
dplyr::filter(HET_PROP > THRESHOLD_OUTLIERS_HIGH | HET_PROP < THRESHOLD_OUTLIERS_LOW) %>%
dplyr::distinct(INDIVIDUALS)
} else {
blacklist.ind.het <- dplyr::ungroup(het.ind) %>%
dplyr::filter(HET_PROP > threshold.max | HET_PROP < threshold.min) %>%
dplyr::distinct(INDIVIDUALS)
}
message("Filter individual's heterozygosity: ", length(blacklist.ind.het$INDIVIDUALS), " individual(s) blacklisted")
readr::write_tsv(
x = blacklist.ind.het,
file = file.path(path.folder, "blacklist.ind.het.tsv"),
col_names = TRUE
)
} else {
blacklist.ind.het <- "ind.heterozygosity.threshold is necessary to get a blacklist of individuals"
}
check.mono <- FALSE
if (!is.null(nrow(blacklist.ind.het)) && nrow(blacklist.ind.het) > 0) {
check.mono <- TRUE
n.ind.blacklisted <- length(blacklist.ind.het$INDIVIDUALS)
if (data.type == "SeqVarGDSClass") {
id.info <- extract_individuals_metadata(gds = data, whitelist = FALSE) %>%
dplyr::mutate(
FILTERS = dplyr::if_else(
INDIVIDUALS %in% blacklist.ind.het$INDIVIDUALS,
"detect_mixed_genomes", FILTERS
)
)
update_radiator_gds(gds = data, node.name = "individuals.meta", value = id.info, sync = TRUE)
} else {
data %<>% dplyr::filter(!INDIVIDUALS %in% blacklist.ind.het$INDIVIDUALS)
}
}#End blacklisted
# updating parameters
if (by.strata && !is.null(ind.heterozygosity.threshold)) {
ind.het.thresholds.by.strata %<>%
dplyr::mutate(
THRESHOLDS = stringi::stri_join(THRESHOLD_OUTLIERS_LOW, THRESHOLD_OUTLIERS_HIGH, sep = "/"),
VALUES = stringi::stri_join(STRATA, THRESHOLDS, sep = ": ")
)
filters.parameters <- radiator_parameters(
generate = FALSE,
initiate = FALSE,
update = TRUE,
parameter.obj = filters.parameters,
data = data,
filter.name = "detect mixed genomes",
param.name = "ind.heterozygosity.threshold (min/max) by strata",
values = paste(ind.het.thresholds.by.strata$VALUES, collapse = ", "),
path.folder = path.folder,
file.date = file.date,
internal = internal,
verbose = verbose)
} else {
filters.parameters <- radiator_parameters(
generate = FALSE,
initiate = FALSE,
update = TRUE,
parameter.obj = filters.parameters,
data = data,
filter.name = "detect mixed genomes",
param.name = "ind.heterozygosity.threshold (min/max)",
values = paste(threshold.min, threshold.max, collapse = " / "),
path.folder = path.folder,
file.date = file.date,
internal = internal,
verbose = verbose)
}
# results ------------------------------------------------------------------
if (by.strata && !is.null(ind.heterozygosity.threshold)) {
radiator_results_message(
rad.message = stringi::stri_join("Detect mixed genomes: ",
paste(ind.het.thresholds.by.strata$VALUES, collapse = ", ")),
filters.parameters,
internal,
verbose
)
} else {
radiator_results_message(
rad.message = stringi::stri_join("Detect mixed genomes: ",
paste(threshold.min, threshold.max,
collapse = " / ")),
filters.parameters,
internal,
verbose
)
}
# MONOMORPHIC MARKERS --------------------------------------------------
if (check.mono) {
data <- filter_monomorphic(
data = data,
parallel.core = parallel.core,
verbose = FALSE,
parameters = filters.parameters,
path.folder = path.folder,
internal = FALSE)
}
}#before this one
return(data)
}#End detect_mixed_genomes
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.