R/gl.diagnostics.hwe.r
In dartR.base: Analysing 'SNP' and 'Silicodart' Data - Basic Functions

Documented in gl.diagnostics.hwe

#' @name gl.diagnostics.hwe
#' @title Provides descriptive stats and plots to diagnose potential problems
#'   with Hardy-Weinberg proportions
#'   @family matched report

#' @description Different causes may be responsible for lack of Hardy-Weinberg
#' proportions. This function helps diagnose potential problems.
#' 
#' @inheritParams gl.report.hwe
#' @inheritParams utils.jackknife
#' @param n.cores The number of cores to use. If "auto", it will 
#' use all but one available cores [default "auto"].
#' @param bins Number of bins to display in histograms [default 20].
#' @param stdErr Whether standard errors for Fis and Fst should be computed 
#' (default: TRUE)
#' @param colors.hist List of two color names for the borders and fill of the
#'   histogram [default gl.colors(2)].
#' @param colors.barplot Vector with two color names for the observed and
#'   expected number of significant HWE tests [default gl.colors("2c")].
#' @param plot.theme User specified theme [default theme_dartR()].
#' @param plot.dir Directory in which to save files [default = working directory]
#' @param plot.file Name for the RDS binary file to save (base name only, exclude extension)
#' [default NULL]
#' @details This function initially runs \code{\link{gl.report.hwe}} and reports
#' the ternary plots. The remaining outputs follow the recommendations from
#'  Waples
#' (2015) paper and De Meeûs 2018. These include: 
#' \enumerate{ 
#' \item A histogram
#' with the distribution of p-values of the HWE tests. The distribution should
#' be roughly uniform across equal-sized bins. 
#' \item A bar plot with observed
#' and expected (null expectation) number of significant HWE tests for the same
#' locus in multiple populations (that is, the x-axis shows whether a locus
#' results significant in 1, 2, ..., n populations. The y axis is the count of
#' these occurrences. The zero value on x-axis shows the number of
#' non-significant tests). If HWE tests are significant by chance alone,
#' observed and expected number of HWE tests should have roughly a similar
#' distribution. 
#' \item A scatter plot with a linear regression between Fst and Fis,
#'  averaged across subpopulations. De Meeûs 2018 suggests that in the case of
#' Null alleles, a strong positive relationship is expected (together with the
#' Fis standard error much larger than the Fst standard error, see below).
#' \bold{Note}, this is not the scatter plot that Waples 2015 presents in his
#' paper. In the lower right corner of the plot, the Pearson correlation
#' coefficient is reported. 
#' \item The Fis and Fst (averaged over loci and
#' subpopulations) standard errors are also printed on screen and reported in
#' the returned list (if \code{stdErr=TRUE}). These are computed with the 
#' Jackknife method over loci (See De Meeûs 2007 for details on how this is 
#' computed) and it may take some time for these computations to complete. 
#' De Meeûs 2018 suggests that under a global significant heterozygosity 
#' deficit: 

#' - if the
#' correlation between Fis and Fst is strongly positive, and StdErrFis >>
#' StdErrFst, Null alleles are likely to be the cause. 

#' - if the correlation
#' between Fis and Fst is ~0 or mildly positive, and StdErrFis > StdErrFst,
#' Wahlund may be the cause. 

#' - if the correlation between Fis and Fst is ~0, and
#' StdErrFis ~ StdErrFst, selfing or sib mating could to be the cause.

#'  It is
#' important to realise that these statistics only suggest a pattern (pointers).
#' Their absence is not conclusive evidence of the absence of the problem, as 
#' their presence does not confirm the cause of the problem. 
#' \item A table where the
#' number of observed and expected significant HWE tests are reported by each
#' population, indicating whether these are due to heterozygosity excess or
#' deficiency. These can be used to have a clue of potential problems (e.g.
#' deficiency might be due to a Wahlund effect, presence of null alleles or
#' non-random sampling; excess might be due to sex linkage or different
#' selection between sexes, demographic changes or small Ne. See Table 1 in
#' Wapples 2015). The last two columns of the table generated by this function
#' report chisquare values and their associated p-values. Chisquare is computed
#' following Fisher's procedure for a global test (Fisher 1970). This basically
#' tests whether there is at least one test that is truly significant in the
#' series of tests conducted (De Meeûs et al 2009).
#' }
#' @author Custodian: Carlo Pacioni -- Post to
#'   \url{https://groups.google.com/d/forum/dartr}
#' @examples
#' \donttest{
#' require("dartR.data")
#' res <- gl.diagnostics.hwe(x = gl.filter.allna(platypus.gl[,1:50]), 
#' stdErr=FALSE, n.cores=1)
#' }
#' @references \itemize{ 
#' \item de Meeûs, T., McCoy, K.D., Prugnolle, F.,
#' Chevillon, C., Durand, P., Hurtrez-Boussès, S., Renaud, F., 2007. Population
#' genetics and molecular epidemiology or how to “débusquer la bête”. Infection,
#' Genetics and Evolution 7, 308-332. 
#' \item De Meeûs, T., Guégan, J.-F., Teriokhin, A.T., 2009. MultiTest V.1.2, 
#' a program to binomially combine independent tests and performance comparison 
#' with other related methods on
#' proportional data. BMC Bioinformatics 10, 443-443. 
#' \item De Meeûs, T., 2018. Revisiting FIS, FST, Wahlund Effects, and Null 
#' Alleles. Journal of Heredity 109, 446-456.
#' \item Fisher, R., 1970.
#' Statistical methods for research workers Edinburgh: Oliver and Boyd. 
#' \item
#' Waples, R. S. (2015). Testing for Hardy–Weinberg proportions: have we lost
#' the plot?. Journal of heredity, 106(1), 1-19.
#'  }
#'  
#' @seealso \code{\link{gl.report.hwe}}

#' @rawNamespace import(data.table, except = c(melt,dcast))
#' @export
#' @return A list with the table with the summary of the HWE tests and (if 
#' stdErr=TRUE) a named vector with the StdErrFis and StdErrFst.

gl.diagnostics.hwe <- function(x,
                               alpha_val = 0.05,
                               bins = 20,
                               stdErr = TRUE,
                               colors.hist = gl.colors(2),
                               colors.barplot = gl.colors("2c"),
                               plot.theme = theme_dartR(),
                               n.cores = "auto",
                               plot.file=NULL,
                               plot.dir=NULL,                               
                               verbose = NULL) {
  # SET VERBOSITY
  verbose <- gl.check.verbosity(verbose)
  
  # SET WORKING DIRECTORY
  plot.dir <- gl.check.wd(plot.dir,verbose=0)
  
  
  # FLAG SCRIPT START
  funname <- match.call()[[1]]
  utils.flag.start(func = funname,
                   build = "v.2023.3",
                   verbose = verbose)
  
  # CHECK DATATYPE
  datatype <- utils.check.datatype(x, verbose = verbose)
  
  # DO THE JOB
  # Set NULL to variables to pass CRAN checks
  Prob<-Sig<-N<-Locus<-Population<-Freq<-Data<-dumpop<-Deficiency<-Fis<-
    Excess<-pvalue<-ChiSquare<-Fst<-gen <-He <-value<- variable <-fst_obs<-NULL
  
  # Helper function
  extractParam <- function(i, l, param) {
    return(l[[i]][["overall"]][param])
  }
  #remove populations with only one individual
  drop_pop <- names(which(table(pop(x)) <2))
  if(length(drop_pop)>0){
  x <- gl.drop.pop(x,pop.list = drop_pop)
    cat(warn("Population with one individual were ignored in this analysis:", drop_pop, "\n"))
  }
  
  # Distribution of p-values by equal bins
  suppressWarnings(hweout <- gl.report.hwe(x, 
                                           sig_only = FALSE,
                                           verbose = 0))
  
  p1 <-   ggplot(hweout, aes(Prob)) +
    geom_histogram(bins = bins,
                   color = colors.hist[1],
                   fill = colors.hist[2]) +
    geom_hline(
      aes(
        yintercept = nrow(hweout) / (bins - 1),
        linetype = "Mean number\nof significant\nHWE tests"
      ),
      col = "red",
      size = 1
    ) +
    scale_linetype_manual(name = "", values = 'solid') +
    coord_cartesian(xlim = c(0, 1)) +
    xlab("Probability of departure from H-W proportions") +
    ylab("Count") +
    ggtitle("Distribution of p-values of HWE tests") +
    plot.theme
  
  # Fst vs Fis scatter plot with linear regression
  # lpops <- seppop(x)
  # lFstats <- lapply(lpops, utils.basic.stats, verbose = 0)
  # lFstats <- lapply(lFstats, "[[", "perloc")
  # Fstats <- rbindlist(l = lFstats, use.names = TRUE, idcol = TRUE)
  Fstats <- utils.basic.stats(x)
  
  # Number of loci out of HWE as a function of a population
  hweout.dt <- data.table(hweout)
  nTimesBypop <- hweout.dt[, .N, by = c("Locus", "Sig")]
  setkey(nTimesBypop, Sig)
  
  nTimesBypop.df <- as.data.frame(table(nTimesBypop["sig", N]))
  
  # Include the non-sig tests
  nTimesBypop.df <-
    rbind(data.frame(Var1 = 0, Freq = nTimesBypop["no_sig", sum(N)]),
          nTimesBypop.df)
  nTimesBypop.df$Data <- "Observed"
  
  # Generate the null distribution
  nullDist <-
    as.data.frame(table(rbinom(
      length(hweout.dt[, unique(Locus)]),
      size = length(hweout.dt[, unique(Population)]),
      prob = alpha_val
    )))
  nullDist$Data <- "Null expectation"
  
  # Compile the data for the plot
  nTimesBypop.fin <- rbind(nTimesBypop.df, nullDist)
  names(nTimesBypop.fin)[1] <- "nPop"
  nTimesBypop.fin$nPop <- factor(as.integer(nTimesBypop.fin$nPop))
  
  p2 <- ggplot(nTimesBypop.fin, aes(nPop, Freq, fill = Data)) +
    geom_col(position = "dodge2",
             alpha = 0.85,
             color = "black") +
    scale_fill_manual(values = c("Observed" = colors.barplot[1],
                                 "Null expectation" = colors.barplot[2])) +
    scale_y_log10() +
    xlab("Number of populations in which loci depart from HWE") +
    ylab("Count") +
    ggtitle(label =  "Number of significant HWE tests for\nthe same locus in 
            multiple populations")+
    plot.theme 
  
  # Collate HWE tests and Fis per locus and pop
  FisPops <- data.table(Fstats$Fis, keep.rownames = TRUE)
  
  # fix the headings when there is only one pop
  if (length(levels(pop(x))) == 1) {
    #FisPops[, dumpop := NULL]
    setnames(FisPops, "round(Fis, 4)", levels(pop(x)))
  }
  setnames(FisPops, "rn", "Locus")
  FisPopsLong <-
    data.table::melt(
      FisPops,
      id.vars = "Locus",
      variable.name = "Population",
      value.name = "Fis"
    )
  #  FisPopsLong[, Locus := sub("^X", "", Locus)]
  # hweout.dt[, Locus := gsub("-|/", replacement = ".", x = Locus)]
  hwe_Fis <- merge(hweout.dt, FisPopsLong, by = c("Locus", "Population"))
  hwe_Fis[, Deficiency := Fis > 0]
  hwe_Fis[,  Excess := Fis < 0]
  setkey(hwe_Fis, Sig)
  
  hwe_summary <-
    hwe_Fis["sig", .(
      nSig = .N,
      nExpected = alpha_val * nLoc(x),
      Deficiency = sum(Deficiency, na.rm = TRUE),
      Excess = sum(Excess, na.rm = TRUE),
      PropDeficiency = sum(Deficiency, na.rm = TRUE) /
        .N
    ),
    by = Population]
  
  chsq <-
    hwe_Fis[, .(ChiSquare = -2 * (sum(log(Prob)))), by = Population]
  chsq[, pvalue := pchisq(ChiSquare, 2 * nLoc(x), lower.tail = FALSE)]
  hwe_summary <- merge(hwe_summary, chsq, by = "Population")
  
  # Fis vs Fst plot
  
  corr <- round(cor(Fstats$perloc$Fis, Fstats$perloc$Fst, 
                    "pairwise.complete.obs"), 3)
  p3 <- ggplot(Fstats$perloc, aes(Fst, Fis)) + 
    geom_point(size=2,color=colors.barplot[1],alpha=0.5) + 
    geom_smooth(method = "lm",color=colors.barplot[2],fill=colors.barplot[2],
                size=1) +
    annotate("text", 
             x=min(Fstats$perloc$Fst, na.rm = TRUE) +
               (max(Fstats$perloc$Fst, na.rm = TRUE) - 
                  min(Fstats$perloc$Fst, na.rm = TRUE))*0.75, 
             y = min(Fstats$perloc$Fis, na.rm = TRUE) +
               (max(Fstats$perloc$Fis, na.rm = TRUE) -
                  min(Fstats$perloc$Fis, na.rm = TRUE))*0.25, 
             col="black",
             label= paste("r: ", corr,sep=""),parse=TRUE,
             size=4) +
    xlab("Fst") +
    ylab("Fis") +
    ggtitle("Fst vs Fis by locus") +
    plot.theme
  if(stdErr) {
    jck <- utils.jackknife(x, FUN="utils.basic.stats", unit="loc", 
                           n.cores = n.cores)
    
    jckFst <- sapply(seq_len(nLoc(x)), extractParam, l=jck, param="Fst")
    jckFis <- sapply(seq_len(nLoc(x)), extractParam, l=jck, param="Fis")
    stdErrFst <- sqrt(var(jckFst)/nLoc(x))
    stdErrFis <- sqrt(var(jckFis)/nLoc(x)) 
    
    cat(report("The variation of Fis and Fst, respectively\n (measured as 
               standard error with the Jackknife method - see De Meeus 2018) 
               is:",
               paste(c(stdErrFis, stdErrFst), collapse = ", "), "\n Fis vs Fst 
               ratio is:", 
               round(stdErrFis/stdErrFst, 2), "\n"))
  }
  
  # PRINTING OUTPUTS
  # using package patchwork
  p5 <- (p1 / p2 / p3)
  print(p5)
  
  print(hwe_summary, row.names = FALSE)
  
  # Optionally save the plot ---------------------
  
  if(!is.null(plot.file)){
    tmp <- utils.plot.save(p5,
                           dir=plot.dir,
                           file=plot.file,
                           verbose=verbose)
  }
  if(!is.null(plot.file)){
    tmp <- utils.plot.save(hwe_summary,
                           dir=plot.dir,
                           file=paste0(plot.file,"_tab"),
                           verbose=verbose)
  }
  
  # FLAG SCRIPT END
  
  if (verbose >= 1) {
    cat(report("Completed:", funname, "\n"))
  }
  
  # RETURN
  if(stdErr) {
    StdErr <- c(stdErrFis, stdErrFst)
    names(StdErr) <- c(c("stdErrFis", "stdErrFst"))
    return(invisible(list(hwe_summary=hwe_summary, StdErr=StdErr)))
  } else {
    return(invisible(list(hwe_summary=hwe_summary)))
  }
}