Nothing
#' @name gl.diagnostics.hwe
#' @title Provides descriptive stats and plots to diagnose potential problems
#' with Hardy-Weinberg proportions
#' @family matched report
#' @description Different causes may be responsible for lack of Hardy-Weinberg
#' proportions. This function helps diagnose potential problems.
#'
#' @inheritParams gl.report.hwe
#' @inheritParams utils.jackknife
#' @param n.cores The number of cores to use. If "auto", it will
#' use all but one available cores [default "auto"].
#' @param bins Number of bins to display in histograms [default 20].
#' @param stdErr Whether standard errors for Fis and Fst should be computed
#' (default: TRUE)
#' @param colors.hist List of two color names for the borders and fill of the
#' histogram [default gl.colors(2)].
#' @param colors.barplot Vector with two color names for the observed and
#' expected number of significant HWE tests [default gl.colors("2c")].
#' @param plot.theme User specified theme [default theme_dartR()].
#' @param plot.dir Directory in which to save files [default = working directory]
#' @param plot.file Name for the RDS binary file to save (base name only, exclude extension)
#' [default NULL]
#' @details This function initially runs \code{\link{gl.report.hwe}} and reports
#' the ternary plots. The remaining outputs follow the recommendations from
#' Waples
#' (2015) paper and De Meeûs 2018. These include:
#' \enumerate{
#' \item A histogram
#' with the distribution of p-values of the HWE tests. The distribution should
#' be roughly uniform across equal-sized bins.
#' \item A bar plot with observed
#' and expected (null expectation) number of significant HWE tests for the same
#' locus in multiple populations (that is, the x-axis shows whether a locus
#' results significant in 1, 2, ..., n populations. The y axis is the count of
#' these occurrences. The zero value on x-axis shows the number of
#' non-significant tests). If HWE tests are significant by chance alone,
#' observed and expected number of HWE tests should have roughly a similar
#' distribution.
#' \item A scatter plot with a linear regression between Fst and Fis,
#' averaged across subpopulations. De Meeûs 2018 suggests that in the case of
#' Null alleles, a strong positive relationship is expected (together with the
#' Fis standard error much larger than the Fst standard error, see below).
#' \bold{Note}, this is not the scatter plot that Waples 2015 presents in his
#' paper. In the lower right corner of the plot, the Pearson correlation
#' coefficient is reported.
#' \item The Fis and Fst (averaged over loci and
#' subpopulations) standard errors are also printed on screen and reported in
#' the returned list (if \code{stdErr=TRUE}). These are computed with the
#' Jackknife method over loci (See De Meeûs 2007 for details on how this is
#' computed) and it may take some time for these computations to complete.
#' De Meeûs 2018 suggests that under a global significant heterozygosity
#' deficit:
#' - if the
#' correlation between Fis and Fst is strongly positive, and StdErrFis >>
#' StdErrFst, Null alleles are likely to be the cause.
#' - if the correlation
#' between Fis and Fst is ~0 or mildly positive, and StdErrFis > StdErrFst,
#' Wahlund may be the cause.
#' - if the correlation between Fis and Fst is ~0, and
#' StdErrFis ~ StdErrFst, selfing or sib mating could to be the cause.
#' It is
#' important to realise that these statistics only suggest a pattern (pointers).
#' Their absence is not conclusive evidence of the absence of the problem, as
#' their presence does not confirm the cause of the problem.
#' \item A table where the
#' number of observed and expected significant HWE tests are reported by each
#' population, indicating whether these are due to heterozygosity excess or
#' deficiency. These can be used to have a clue of potential problems (e.g.
#' deficiency might be due to a Wahlund effect, presence of null alleles or
#' non-random sampling; excess might be due to sex linkage or different
#' selection between sexes, demographic changes or small Ne. See Table 1 in
#' Wapples 2015). The last two columns of the table generated by this function
#' report chisquare values and their associated p-values. Chisquare is computed
#' following Fisher's procedure for a global test (Fisher 1970). This basically
#' tests whether there is at least one test that is truly significant in the
#' series of tests conducted (De Meeûs et al 2009).
#' }
#' @author Custodian: Carlo Pacioni -- Post to
#' \url{https://groups.google.com/d/forum/dartr}
#' @examples
#' \donttest{
#' require("dartR.data")
#' res <- gl.diagnostics.hwe(x = gl.filter.allna(platypus.gl[,1:50]),
#' stdErr=FALSE, n.cores=1)
#' }
#' @references \itemize{
#' \item de Meeûs, T., McCoy, K.D., Prugnolle, F.,
#' Chevillon, C., Durand, P., Hurtrez-Boussès, S., Renaud, F., 2007. Population
#' genetics and molecular epidemiology or how to “débusquer la bête”. Infection,
#' Genetics and Evolution 7, 308-332.
#' \item De Meeûs, T., Guégan, J.-F., Teriokhin, A.T., 2009. MultiTest V.1.2,
#' a program to binomially combine independent tests and performance comparison
#' with other related methods on
#' proportional data. BMC Bioinformatics 10, 443-443.
#' \item De Meeûs, T., 2018. Revisiting FIS, FST, Wahlund Effects, and Null
#' Alleles. Journal of Heredity 109, 446-456.
#' \item Fisher, R., 1970.
#' Statistical methods for research workers Edinburgh: Oliver and Boyd.
#' \item
#' Waples, R. S. (2015). Testing for Hardy–Weinberg proportions: have we lost
#' the plot?. Journal of heredity, 106(1), 1-19.
#' }
#'
#' @seealso \code{\link{gl.report.hwe}}
#' @rawNamespace import(data.table, except = c(melt,dcast))
#' @export
#' @return A list with the table with the summary of the HWE tests and (if
#' stdErr=TRUE) a named vector with the StdErrFis and StdErrFst.
gl.diagnostics.hwe <- function(x,
alpha_val = 0.05,
bins = 20,
stdErr = TRUE,
colors.hist = gl.colors(2),
colors.barplot = gl.colors("2c"),
plot.theme = theme_dartR(),
n.cores = "auto",
plot.file=NULL,
plot.dir=NULL,
verbose = NULL) {
# SET VERBOSITY
verbose <- gl.check.verbosity(verbose)
# SET WORKING DIRECTORY
plot.dir <- gl.check.wd(plot.dir,verbose=0)
# FLAG SCRIPT START
funname <- match.call()[[1]]
utils.flag.start(func = funname,
build = "v.2023.3",
verbose = verbose)
# CHECK DATATYPE
datatype <- utils.check.datatype(x, verbose = verbose)
# DO THE JOB
# Set NULL to variables to pass CRAN checks
Prob<-Sig<-N<-Locus<-Population<-Freq<-Data<-dumpop<-Deficiency<-Fis<-
Excess<-pvalue<-ChiSquare<-Fst<-gen <-He <-value<- variable <-fst_obs<-NULL
# Helper function
extractParam <- function(i, l, param) {
return(l[[i]][["overall"]][param])
}
#remove populations with only one individual
drop_pop <- names(which(table(pop(x)) <2))
if(length(drop_pop)>0){
x <- gl.drop.pop(x,pop.list = drop_pop)
cat(warn("Population with one individual were ignored in this analysis:", drop_pop, "\n"))
}
# Distribution of p-values by equal bins
suppressWarnings(hweout <- gl.report.hwe(x,
sig_only = FALSE,
verbose = 0))
p1 <- ggplot(hweout, aes(Prob)) +
geom_histogram(bins = bins,
color = colors.hist[1],
fill = colors.hist[2]) +
geom_hline(
aes(
yintercept = nrow(hweout) / (bins - 1),
linetype = "Mean number\nof significant\nHWE tests"
),
col = "red",
size = 1
) +
scale_linetype_manual(name = "", values = 'solid') +
coord_cartesian(xlim = c(0, 1)) +
xlab("Probability of departure from H-W proportions") +
ylab("Count") +
ggtitle("Distribution of p-values of HWE tests") +
plot.theme
# Fst vs Fis scatter plot with linear regression
# lpops <- seppop(x)
# lFstats <- lapply(lpops, utils.basic.stats, verbose = 0)
# lFstats <- lapply(lFstats, "[[", "perloc")
# Fstats <- rbindlist(l = lFstats, use.names = TRUE, idcol = TRUE)
Fstats <- utils.basic.stats(x)
# Number of loci out of HWE as a function of a population
hweout.dt <- data.table(hweout)
nTimesBypop <- hweout.dt[, .N, by = c("Locus", "Sig")]
setkey(nTimesBypop, Sig)
nTimesBypop.df <- as.data.frame(table(nTimesBypop["sig", N]))
# Include the non-sig tests
nTimesBypop.df <-
rbind(data.frame(Var1 = 0, Freq = nTimesBypop["no_sig", sum(N)]),
nTimesBypop.df)
nTimesBypop.df$Data <- "Observed"
# Generate the null distribution
nullDist <-
as.data.frame(table(rbinom(
length(hweout.dt[, unique(Locus)]),
size = length(hweout.dt[, unique(Population)]),
prob = alpha_val
)))
nullDist$Data <- "Null expectation"
# Compile the data for the plot
nTimesBypop.fin <- rbind(nTimesBypop.df, nullDist)
names(nTimesBypop.fin)[1] <- "nPop"
nTimesBypop.fin$nPop <- factor(as.integer(nTimesBypop.fin$nPop))
p2 <- ggplot(nTimesBypop.fin, aes(nPop, Freq, fill = Data)) +
geom_col(position = "dodge2",
alpha = 0.85,
color = "black") +
scale_fill_manual(values = c("Observed" = colors.barplot[1],
"Null expectation" = colors.barplot[2])) +
scale_y_log10() +
xlab("Number of populations in which loci depart from HWE") +
ylab("Count") +
ggtitle(label = "Number of significant HWE tests for\nthe same locus in
multiple populations")+
plot.theme
# Collate HWE tests and Fis per locus and pop
FisPops <- data.table(Fstats$Fis, keep.rownames = TRUE)
# fix the headings when there is only one pop
if (length(levels(pop(x))) == 1) {
#FisPops[, dumpop := NULL]
setnames(FisPops, "round(Fis, 4)", levels(pop(x)))
}
setnames(FisPops, "rn", "Locus")
FisPopsLong <-
data.table::melt(
FisPops,
id.vars = "Locus",
variable.name = "Population",
value.name = "Fis"
)
# FisPopsLong[, Locus := sub("^X", "", Locus)]
# hweout.dt[, Locus := gsub("-|/", replacement = ".", x = Locus)]
hwe_Fis <- merge(hweout.dt, FisPopsLong, by = c("Locus", "Population"))
hwe_Fis[, Deficiency := Fis > 0]
hwe_Fis[, Excess := Fis < 0]
setkey(hwe_Fis, Sig)
hwe_summary <-
hwe_Fis["sig", .(
nSig = .N,
nExpected = alpha_val * nLoc(x),
Deficiency = sum(Deficiency, na.rm = TRUE),
Excess = sum(Excess, na.rm = TRUE),
PropDeficiency = sum(Deficiency, na.rm = TRUE) /
.N
),
by = Population]
chsq <-
hwe_Fis[, .(ChiSquare = -2 * (sum(log(Prob)))), by = Population]
chsq[, pvalue := pchisq(ChiSquare, 2 * nLoc(x), lower.tail = FALSE)]
hwe_summary <- merge(hwe_summary, chsq, by = "Population")
# Fis vs Fst plot
corr <- round(cor(Fstats$perloc$Fis, Fstats$perloc$Fst,
"pairwise.complete.obs"), 3)
p3 <- ggplot(Fstats$perloc, aes(Fst, Fis)) +
geom_point(size=2,color=colors.barplot[1],alpha=0.5) +
geom_smooth(method = "lm",color=colors.barplot[2],fill=colors.barplot[2],
size=1) +
annotate("text",
x=min(Fstats$perloc$Fst, na.rm = TRUE) +
(max(Fstats$perloc$Fst, na.rm = TRUE) -
min(Fstats$perloc$Fst, na.rm = TRUE))*0.75,
y = min(Fstats$perloc$Fis, na.rm = TRUE) +
(max(Fstats$perloc$Fis, na.rm = TRUE) -
min(Fstats$perloc$Fis, na.rm = TRUE))*0.25,
col="black",
label= paste("r: ", corr,sep=""),parse=TRUE,
size=4) +
xlab("Fst") +
ylab("Fis") +
ggtitle("Fst vs Fis by locus") +
plot.theme
if(stdErr) {
jck <- utils.jackknife(x, FUN="utils.basic.stats", unit="loc",
n.cores = n.cores)
jckFst <- sapply(seq_len(nLoc(x)), extractParam, l=jck, param="Fst")
jckFis <- sapply(seq_len(nLoc(x)), extractParam, l=jck, param="Fis")
stdErrFst <- sqrt(var(jckFst)/nLoc(x))
stdErrFis <- sqrt(var(jckFis)/nLoc(x))
cat(report("The variation of Fis and Fst, respectively\n (measured as
standard error with the Jackknife method - see De Meeus 2018)
is:",
paste(c(stdErrFis, stdErrFst), collapse = ", "), "\n Fis vs Fst
ratio is:",
round(stdErrFis/stdErrFst, 2), "\n"))
}
# PRINTING OUTPUTS
# using package patchwork
p5 <- (p1 / p2 / p3)
print(p5)
print(hwe_summary, row.names = FALSE)
# Optionally save the plot ---------------------
if(!is.null(plot.file)){
tmp <- utils.plot.save(p5,
dir=plot.dir,
file=plot.file,
verbose=verbose)
}
if(!is.null(plot.file)){
tmp <- utils.plot.save(hwe_summary,
dir=plot.dir,
file=paste0(plot.file,"_tab"),
verbose=verbose)
}
# FLAG SCRIPT END
if (verbose >= 1) {
cat(report("Completed:", funname, "\n"))
}
# RETURN
if(stdErr) {
StdErr <- c(stdErrFis, stdErrFst)
names(StdErr) <- c(c("stdErrFis", "stdErrFst"))
return(invisible(list(hwe_summary=hwe_summary, StdErr=StdErr)))
} else {
return(invisible(list(hwe_summary=hwe_summary)))
}
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.