R/gl.report.heterozygosity.r

Defines functions gl.report.heterozygosity

Documented in gl.report.heterozygosity

#' @name gl.report.heterozygosity
#' @title Reports observed, expected and unbiased heterozygosities and FIS
#' (inbreeding coefficient) by population or by individual from SNP data
#' @family unmatched report
#'
#' @description Calculates the observed, expected and unbiased expected (i.e.
#' corrected for sample size) heterozygosities and FIS (inbreeding coefficient)
#' for each population or the observed heterozygosity for each individual in a
#' genlight object.

#' @param x Name of the genlight object containing the SNP data [required].
#' @param method Calculate heterozygosity by population (method='pop') or by
#' individual (method='ind') [default 'pop'].
#' @param n.invariant An estimate of the number of invariant sequence tags used
#' to adjust the heterozygosity rate [default 0].
#' @param subsample.pop Whether subsample populations to estimate observed 
#' heterozygosity (see Details) [default FALSE].
#' @param n.limit Minimum number of individuals that should have a population to 
#' perform subsampling to estimate heterozygosity [default 10].
#' @param nboots Number of bootstrap replicates to obtain confidence intervals
#' [default 0].
#' @param boot.method boostraping across individuals ("ind") or across loci
#'  ("loc") [default "ind"].
#' @param conf The confidence level of the required interval [default 0.95].
#' @param CI.type Method to estimate confidence intervals. One of
#' "norm", "basic", "perc" or "bca" [default "bca"].
#' @param ncpus Number of processes to be used in parallel operation. If ncpus
#' > 1 parallel operation is activated, see "Details" section [default 1].
#' @param plot.display Specify if plot is to be produced [default TRUE].
#' @param plot.theme Theme for the plot. See Details for options
#' [default theme_dartR()].
#' @param plot.colors.pop A color palette for population plots or a list with
#' as many colors as there are populations in the dataset 
#' [default gl.colors("dis")].
#' @param plot.colors.ind List of two color names for the borders and fill of
#' the plot by individual [default gl.colors(2)].
#' @param error.bar Statistic to be plotted as error bar either "SD" (standard 
#' deviation) or "SE" (standard error) or "CI" (confidence intervals)
#'  [default "SD"].
#' @param plot.dir Directory to save the plot RDS files [default as specified 
#' by the global working directory or tempdir()].
#' @param plot.file Name for the RDS binary file to save (base name only, 
#' exclude extension) [default NULL].
#' @param verbose Verbosity: 0, silent or fatal errors; 1, begin and end; 2,
#' progress log; 3, progress and results summary; 5, full report
#' [default NULL, unless specified using gl.set.verbosity].
#'
#' @details
#' Observed heterozygosity for a population takes the proportion of
#' heterozygous loci for each individual and averages it over all individuals in
#' that population. The calculations take into account missing values.
#'
#' Expected heterozygosity for a population takes the expected proportion of
#' heterozygotes, that is, expected under Hardy-Weinberg equilibrium, for each
#' locus, then averages this across the loci for an average estimate for the
#' population.
#'
#' The unbiased expected heterozygosity is calculated using the correction for 
#' sample size following equation 2 from Nei 1978.
#' 
#' Accuracy of all heterozygosity estimates is affected by small sample sizes,
#' and so is their comparison between populations or repeated analysis. Expected
#' heterozygosities are less affected because their calculations are based on 
#' allele frequencies while observed heterozygosities are strongly susceptible 
#' to sampling effects when the sample size is small.  
#'
#' Observed heterozygosity for individuals is calculated as the proportion of
#' loci that are heterozygous for that individual.
#'
#' Finally, the loci that are invariant across all individuals in the dataset
#' (that is, across populations), is typically unknown. This can render
#' estimates of heterozygosity analysis specific, and so it is not valid to
#' compare such estimates across species or even across different analyses 
#' (see Schimdt et al 2021). This is a similar problem faced by microsatellites. 
#' If you have an estimate of the
#' number of invariant sequence tags (loci) in your data, such as provided by
#' \code{\link{gl.report.secondaries}}, you can specify it with the n.invariant
#' parameter to standardize your estimates of heterozygosity. This is called
#' autosomal heterozygosities by Schimddt et al (2021).
#'
#' \strong{NOTE}: It is important to realise that estimation of adjusted (autosomal)
#' heterozygosity requires that secondaries not to be removed.
#'
#' Heterozygosities and FIS (inbreeding coefficient) are calculated by locus
#' within each population using the following equations, and then averaged across 
#' all loci:
#' \itemize{
#' \item Observed heterozygosity (Ho) = number of heterozygotes / n_Ind,
#' where n_Ind is the number of individuals without missing data for that locus.
#' \item Observed heterozygosity adjusted (Ho.adj) <- Ho * n_Loc /
#'  (n_Loc + n.invariant),
#' where n_Loc is the number of loci that do not have all missing data  and
#' n.invariant is an estimate of the number of invariant loci to adjust
#' heterozygosity.
#' \item Expected heterozygosity (He) = 1 - (p^2 + q^2),
#' where p is the frequency of the reference allele and q is the frequency of
#' the alternative allele.
#' \item Expected heterozygosity adjusted (He.adj) = He * n_Loc /
#' (n_Loc + n.invariant)
#' \item Unbiased expected heterozygosity (uHe) = He * (2 * n_Ind /
#' (2 * n_Ind - 1))
#' \item Inbreeding coefficient (FIS) = 1 - Ho / uHe
#' }

#'\strong{ Function's output }
#'
#' Output for method='pop' is an ordered barchart of observed heterozygosity,
#' unbiased expected heterozygosity and FIS (Inbreeding coefficient) across 
#' populations together with a table of mean observed and expected 
#' heterozygosities and FIS by population and their respective standard 
#' deviations (SD).

#' In the output, it is also reported by population: the number of loci used to
#'  estimate heterozygosity (n.Loc), the number of polymorphic loci (polyLoc),
#'  the number of monomorphic loci (monoLoc) and loci with all missing data
#'   (all_NALoc).
#'   
#' Output for method='ind' is a histogram and a boxplot of heterozygosity across
#' individuals.
#' 
#' If a plot.file is given, the ggplot arising from this function is saved as an 
#' "RDS" binary file using saveRDS(); can be reloaded with readRDS(). A file 
#' name must be specified for the plot to be saved.
#' 
#' If a plot directory (plot.dir) is specified, the ggplot binary is saved to 
#' that directory; otherwise to the tempdir(). 
#'  
#'  Examples of other themes that can be used can be consulted in: 
#'  \itemize{
#'  \item \url{https://ggplot2.tidyverse.org/reference/ggtheme.html} and \item
#'  \url{https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/}
#'  }
#'  
#'  \strong{Subsampling populations}
#'  
#' To test the effect of five population sample sizes (n = 10, 5, 4, 3, 2) on 
#' observed heterozygosity estimates, the function subsamples individuals,
#'  without replacement. The subsampling is repeated 10 times for each sample
#'   size n. This approach is not an implementation of Schmidt et al (2021). 
#'   Please refer to this paper for additional complexities in estimating
#'   heterozygosity using SNP data.
#'  
#'   \strong{Error bars}
#'  
#'  The best method for presenting or assessing genetic statistics depends on 
#'  the type of data you have and the specific questions you're trying to 
#'  answer. Here's a brief overview of when you might use each method:
#'  
#'   \strong{1. Confidence Intervals ("CI"):}
#'   
#'  - Usage: Often used to convey the precision of an estimate.
#'  
#'  - Advantage: Confidence intervals give a range in which the true parameter 
#'  (like a population mean) is likely to fall, given the data and a specified 
#'  probability (like 95\%).
#'  
#'  - In Context: For genetic statistics, if you're estimating a parameter,
#'   a 95\% CI gives you a range in which you're 95\% confident the true parameter
#'    lies.
#'  
#'   \strong{2. Standard Deviation ("SD"):}
#'   
#'  - Usage: Describes the amount of variation from the average in a set of data.
#'  
#'  - Advantage: Allows for an understanding of the spread of individual data
#'   points around the mean.
#'   
#'  - In Context: If you're looking at the distribution of a quantitative trait 
#'  (like height) in a population with a particular genotype, the SD can 
#'  describe how much individual heights vary around the average height.
#'  
#'   \strong{3. Standard Error ("SE"):}
#'   
#'  - Usage: Describes the precision of the sample mean as an estimate of the 
#'  population mean.
#'  
#'  - Advantage: Smaller than the SD in large samples; it takes into account 
#'  both the SD and the sample size. 
#'  
#'  - In Context: If you want to know how accurately your sample mean represents
#'   the population mean, you'd look at the SE.
#'   
#'    \strong{Recommendation:}
#'    
#'   - If you're trying to convey the precision of an estimate, confidence 
#'   intervals are very useful.
#'   
#'   - For understanding variability within a sample, standard deviation is key.
#'   
#'   - To see how well a sample mean might estimate a population mean, consider 
#'   the standard error.
#'   
#'   In practice, geneticists often use a combination of these methods to 
#'   analyze and present their data, depending on their research questions and 
#'   the nature of the data.
#'   
#'  \strong{Confidence Intervals}
#'
#' The uncertainty of a parameter, in this case the mean of the statistic, can
#' be summarised by a confidence interval (CI) which includes the true parameter
#' value with a specified probability (i.e. confidence level; the parameter
#' "conf" in this function).
#'
#' In this function, CI are obtained using Bootstrap which is an inference
#' method that samples with replacement the data (i.e. loci) and calculates the
#'  statistics every time.
#'
#'  This function uses the function \link[boot]{boot} (package boot) to perform
#'  the bootstrap replicates and the function \link[boot]{boot.ci}
#'  (package boot) to perform the calculations for the CI.
#'
#'  Four different types of nonparametric CI can be calculated
#'   (parameter "CI.type" in this function):
#'   \itemize{
#'    \item First order normal approximation interval ("norm").
#'    \item Basic bootstrap interval ("basic").
#'    \item Bootstrap percentile interval ("perc").
#'    \item Adjusted bootstrap percentile interval ("bca").
#'    }
#'
#' The studentized bootstrap interval ("stud") was not included in the CI types
#'  because it is computationally intensive, it may produce estimates outside
#'  the range of plausible values and it has been found to be erratic in
#'  practice, see for example the "Studentized (t) Intervals" section in:
#'
#'    \url{https://www.r-bloggers.com/2019/09/understanding-bootstrap-confidence-interval-output-from-the-r-boot-package/}
#'
#'     Nice tutorials about the different types of CI can be found in:
#'
#'     \url{https://www.datacamp.com/tutorial/bootstrap-r}
#'
#'     and
#'
#'    \url{https://www.r-bloggers.com/2019/09/understanding-bootstrap-confidence-interval-output-from-the-r-boot-package/}
#'
#'      Efron and Tibshirani (1993, p. 162) and Davison and Hinkley
#'      (1997, p. 194) suggest that the number of bootstrap replicates should
#'      be between 1000 and 2000.
#'
#'  \strong{It is important} to note that unreliable confidence intervals will be
#'   obtained if too few number of bootstrap replicates are used.
#'   Therefore, the function \link[boot]{boot.ci} will throw warnings and errors
#'    if bootstrap replicates are too few. Consider increasing the number of
#'    bootstrap replicates to at least 200.
#'
#'    The "bca" interval is often cited as the best for theoretical reasons,
#'    however it may produce unstable results if the bootstrap distribution
#'     is skewed or has extreme values. For example, you might get the warning
#'     "extreme order statistics used as endpoints" or the error "estimated
#'     adjustment 'a' is NA". In this case, you may want to use more bootstrap
#'     replicates or a different method or check your data for outliers.
#'
#'    The error "estimated adjustment 'w' is infinite" means that the estimated
#'    adjustment ‘w’ for the "bca" interval is infinite, which can happen when
#'    the empirical influence values are zero or very close to zero. This can
#'    be caused by various reasons, such as:
#'
#'    The number of bootstrap replicates is too small, the statistic of interest
#'     is constant or nearly constant across the bootstrap samples, the data
#'     contains outliers or extreme values.
#'
#'     You can try some possible solutions, such as:
#'
#' Increasing the number of bootstrap replicates, using a different type of
#' bootstrap confidence interval or removing or transforming the outliers or
#'  extreme values.
#'  
#'  \strong{Parallelisation}
#'
#'  If the parameter ncpus > 1, parallelisation is enabled. In Windows, parallel
#'   computing employs a "socket" approach that starts new copies of R on each
#'    core. POSIX systems, on the other hand (Mac, Linux, Unix, and BSD),
#'    utilise a "forking" approach that replicates the whole current version of
#'     R and transfers it to a new core.
#'
#'     Opening and terminating R sessions in each core involves a significant
#'     amount of processing time, therefore parallelisation in Windows machines
#'    is only quicker than not using parallelisation when nboots > 1000-2000.
#'    
#' @author Custodian: Luis Mijangos (Post to
#' \url{https://groups.google.com/d/forum/dartr})
#'
#' @references
#' \itemize{
#' \item Nei, M. (1978). Estimation of average heterozygosity and genetic 
#' distance from a small number of individuals. Genetics, 89(3), 583-590.
#' \item Schmidt, T. L., Jasper, M. E., Weeks, A. R., & Hoffmann, A. A. (2021).
#'  Unbiased population heterozygosity estimates from genome‐wide sequence data.
#'   Methods in Ecology and Evolution, 12(10), 1888-1898.
#'   }
#'
#' @examples
#'  \donttest{
#' require("dartR.data")
#' df <- gl.report.heterozygosity(platypus.gl)
#' df <- gl.report.heterozygosity(platypus.gl,method='ind')
#' n.inv <- gl.report.secondaries(platypus.gl)
#' gl.report.heterozygosity(platypus.gl, n.invariant = n.inv[7, 2])
#' gl.report.heterozygosity(platypus.gl, subsample.pop = TRUE)
#' }
#' df <- gl.report.heterozygosity(platypus.gl)

#' @seealso \code{\link{gl.filter.heterozygosity}}

#' @export
#' @return A dataframe containing population labels, heterozygosities, FIS,
#' their standard deviations and sample sizes.

gl.report.heterozygosity <- function(x,
                                     method = "pop",
                                     n.invariant = 0,
                                     subsample.pop = FALSE,
                                     n.limit = 10,
                                     nboots = 0,
                                     boot.method = "ind",
                                     conf = 0.95,
                                     CI.type = "bca",
                                     ncpus = 1,
                                     plot.display = TRUE,
                                     plot.theme = theme_dartR(),
                                     plot.colors.pop = gl.colors("dis"),
                                     plot.colors.ind = gl.colors(2),
                                     plot.file = NULL,
                                     plot.dir = NULL,
                                     error.bar = "SD",
                                     verbose = NULL) {
  
  # Utils functions are in utils.het.report.r
  
  # setting parallel
  # if (ncpus > 1) {
    if (grepl("unix", .Platform$OS.type, ignore.case = TRUE)) {
      parallel <- "multicore"
    }
    ## if windows
    if (!grepl("unix", .Platform$OS.type, ignore.case = TRUE)) {
      parallel <- "snow"
    }
  # }
  
  # SET VERBOSITY
  verbose <- gl.check.verbosity(verbose)
  if(verbose==0){plot.display <- FALSE}
  
  # SET WORKING DIRECTORY
  plot.dir <- gl.check.wd(plot.dir,verbose=0)
  
  # FLAG SCRIPT START
  funname <- match.call()[[1]]
  utils.flag.start(func = funname,
                   build = "Jackson",
                   verbose = verbose)
  
  # CHECK DATATYPE
  datatype <-
    utils.check.datatype(x, accept = "SNP", verbose = verbose)
  
  # FUNCTION SPECIFIC ERROR CHECKING
  
  if (!(method == "pop" | method == "ind")) {
    cat(
      warn(
        "Warning: Method must either be by population or by individual,
                set to method='pop'\n"
      )
    )
    method <- "pop"
  }
  
  if (n.invariant < 0) {
    cat(warn(
      "Warning: Number of invariant loci must be non-negative, set to
            zero\n"
    ))
    n.invariant <- 0
    if (verbose == 5) {
      cat(report(
        "  No. of invariant loci can be esimated using
                    gl.report.secondaries\n"
      ))
    }
  }
  
  if (any(grepl(x@other$history, pattern = "gl.filter.secondaries") == TRUE) &
      n.invariant > 0) {
    cat(
      warn(
        "  Warning: Estimation of adjusted heterozygosity requires that
                secondaries not to be removed. A gl.filter.secondaries call was
                found in the history. This may cause the results to be
                incorrect\n"
      )
    )
  }
  
  if( nboots == 0 & error.bar == "CI" ){
    cat(error(
"  Number of boostraps ('nboots' parameter) must be > 0 to calculate confidence 
   intervals \n"))
    stop()
  }
  
  # DO THE JOB
  
  ########### FOR METHOD BASED ON POPULATIONS
  
  if (method == "pop") {
    # Set a population if none is specified (such as if the genlight object
    # has been generated manually)
    if (is.null(pop(x)) |
        is.na(length(pop(x))) | length(pop(x)) <= 0) {
      if (verbose >= 2) {
        cat(
          warn(
            "  No population assignments detected,
                             individuals assigned to a single population
                        labelled 'pop1'\n"
          )
        )
      }
      pop(x) <- array("pop1", dim = nInd(x))
      pop(x) <- as.factor(pop(x))
    }
    
    # Subsampling populations as in Schmidt (2021)
    if(subsample.pop==TRUE){
    res_sub <- utils.subsample.pop(x,n.limit = n.limit)
    }
    
    # Split the genlight object into a list of populations
    sgl <- seppop(x)
    
    # OBSERVED HETEROZYGOSITY
    if (verbose >= 2) {
      cat(
        report(
          "  Calculating Observed Heterozygosities, averaged across
                    loci, for each population\n"
        )
      )
    }
    
    # Calculate heterozygosity for each population in the list
    # CP = Carlo Pacioni CP ###
    Ho.loc <-
      lapply(sgl, function(x)
        colMeans(as.matrix(x) == 1, na.rm = TRUE))

    Ho <-
      unlist(lapply(sgl, function(x)
        mean(
          colMeans(as.matrix(x) == 1, na.rm = TRUE), na.rm = TRUE
        )))
    
    ### CP ### observed heterozygosity standard deviation
    HoSD <-
      unlist(lapply(sgl, function(x)
        sd(
          colMeans(as.matrix(x) == 1, na.rm = TRUE), na.rm = TRUE
        )))
    
    HoSE <- unlist(lapply(sgl, function(x)
      std.error(colMeans(
        as.matrix(x) == 1, na.rm = TRUE
      ))))
    
    ##########
    
    # Calculate the number of loci that are not all NAs CP ###
    n_loc <-
      unlist(lapply(sgl, function(x)
        sum(!(
          colSums(is.na(as.matrix(x))) == nrow(as.matrix(x))
        ))))
    ##########
    
    # calculate the number of polymorphic and monomorphic loci by population
    poly_loc <- NULL
    mono_loc <- NULL
    all_na_loc <- NULL
    
    for (y in 1:length(sgl)) {
      y_temp <- sgl[[y]]
      hold <- y_temp
      mono_tmp <- gl.allele.freq(y_temp, simple = TRUE, verbose = 0)
      loc.list <- rownames(mono_tmp[which(mono_tmp$alf1 == 1 |
                                            mono_tmp$alf1 == 0), ])
      loc.list_NA <- which(colSums(is.na(as.matrix(y_temp)))==nInd(y_temp))
        # rownames(mono_tmp[which(is.na(mono_tmp$alf1)), ])
      
      # Remove NAs from list of monomorphic loci and loci with all NAs
      # loc.list <- loc.list[!is.na(loc.list)]
      
      # remove monomorphic loci and loci with all NAs
      if (length(loc.list) > 0) {
        y_temp <- gl.drop.loc(y_temp, loc.list = loc.list, verbose = 0)
      }
      
      poly_loc <-  c(poly_loc, nLoc(y_temp))
      mono_loc <- c(mono_loc, (nLoc(hold) - nLoc(y_temp) - length(loc.list_NA)))
      all_na_loc <- c(all_na_loc, length(loc.list_NA))
    }
    
    # Apply correction CP ###
    Ho.adj <- Ho * n_loc / (n_loc + n.invariant)
    # Manually compute SD for Ho.adj sum of the square of differences from
    # the mean for polymorphic sites plus sum of the square of differences
    #(which is the Ho.adj because Ho=0) from the mean for invariant sites
    Ho.adjSD <-
      sqrt((
        mapply(function(x, Mean)
          sum((x - Mean) ^ 2, na.rm = TRUE), Ho.loc, Mean = Ho.adj) +
          n.invariant * Ho.adj ^
          2
      ) / (n_loc +
             n.invariant - 1))
    
    Ho.adjSE <-  Ho.adjSD / sqrt(poly_loc+mono_loc)
    
    n_ind <- sapply(sgl, ind.count)
    
    ##########
    
    # EXPECTED HETEROZYGOSITY
    if (verbose >= 2) {
      cat(report("  Calculating Expected Heterozygosities\n\n"))
    }
    
    Hexp <- array(NA, length(sgl))
    HexpSD <- array(NA, length(sgl))
    HexpSE <- array(NA, length(sgl))
    
    uHexp <- array(NA, length(sgl))
    uHexpSD <- array(NA, length(sgl))
    uHexpSE <- array(NA, length(sgl))
    
    Hexp.adj <- array(NA, length(sgl))
    Hexp.adjSD <- array(NA, length(sgl))
    Hexp.adjSE <- array(NA, length(sgl))
    
    FIS <- array(NA, length(sgl))
    FISSD <- array(NA, length(sgl))
    FISSE <- array(NA, length(sgl))
    
    # dataframes to store data for boxplots
    uHe_df <- as.list(rep(NA, length(sgl)))
    Fis_df <- as.list(rep(NA, length(sgl)))
    
    # For each population
    for (i in 1:length(sgl)) {
      gl <- sgl[[i]]
      gl <- utils.recalc.freqhomref(gl, verbose = 0)
      gl <- utils.recalc.freqhomsnp(gl, verbose = 0)
      gl <- utils.recalc.freqhets(gl, verbose = 0)
      p <- gl@other$loc.metrics$FreqHomRef
      q <- gl@other$loc.metrics$FreqHomSnp
      hets <- gl@other$loc.metrics$FreqHets
      p <- (2 * p + hets) / 2
      q <- (2 * q + hets) / 2
      H <- 1 - (p ^ 2 + q ^ 2)
      
      Hexp[i] <- mean(H, na.rm = TRUE)
      HexpSD[i] <- sd(H, na.rm = TRUE)
      HexpSE[i] <- std.error(H)
      
      ### CP ### Unbiased He (i.e. corrected for sample size) hard
      # coded for diploid
      uH <- (2 * as.numeric(n_ind[i]) / (2 * as.numeric(n_ind[i]) - 1)) * H
      uHe_df[[i]] <-  uH
      ### CP ###
      uHexp[i] <- mean(uH, na.rm = TRUE)
      uHexpSD[i] <- sd(uH, na.rm = TRUE)
      uHexpSE[i] <- std.error(uH)
      
      Hexp.adj[i] <- Hexp[i] * n_loc[i] / (n_loc[i] + n.invariant)
      Hexp.adjSD[i] <-
        sqrt((sum((H - Hexp.adj[i]) ^ 2, na.rm = TRUE) + n.invariant * Hexp.adj[i] ^ 2) /
               (n_loc[i] + n.invariant - 1))
      Hexp.adjSE[i] <- Hexp.adjSD[i] / sqrt(poly_loc[i]+mono_loc[i])
    
      FIS_temp <- (uH - Ho.loc[[i]]) /  uH 
      Fis_df[[i]] <- FIS_temp
      FIS[i] <- mean(FIS_temp, na.rm = TRUE)
      FISSD[i] <- sd(FIS_temp, na.rm = TRUE)
      FISSE[i] <- std.error(FIS_temp)
      
    }
    
    # Prep for results
    df.base <-
      data.frame(
        pop = popNames(x),
        n.Ind = round(n_ind, 6),
        n.Loc = n_loc,
        n.Loc.adj = n_loc / (n_loc + n.invariant),
        polyLoc = poly_loc ,
        monoLoc = mono_loc ,
        all_NALoc = all_na_loc)
    
    if (n.invariant == 0) {
      df.params <- data.frame(
        Ho = round(as.numeric(Ho),6),
        HoSD = round(HoSD,6),
        HoSE = round(HoSE, 6),
            
        He = round(Hexp, 6),
        HeSD = round(HexpSD, 6),
        HeSE = round(HexpSE,6),
       
        uHe = round(uHexp, 6),
        uHeSD = round(uHexpSD, 6),
        uHeSE = round(uHexpSE ,6),
       
        FIS = round(FIS,6),
        FISSD = round(FISSD,6),
        FISSE = round(FISSE , 6)
    )
    } else {
      df.params <- data.frame(
        Ho.adj = round(as.numeric(Ho.adj),6),
        Ho.adjSD = round(Ho.adjSD,6),
        Ho.adjSE = round(Ho.adjSE, 6),
        
        He.adj = round(Hexp.adj, 6),
        He.adjSD = round(Hexp.adjSD, 6),
        He.adjSE = round(Hexp.adjSE, 6)
        )
    }
    
    df <- cbind(df.base, df.params)
    npops <- nPop(x)
    
    # bootstrapping
    if (nboots > 0) {
      pop_boot <- lapply(sgl, function(y) {
        
        df <- as.matrix(y)
        
        if(boot.method == "loc"){
          df <- t(df)
        }

        res_boots <- boot::boot(
          data = df,
          statistic = pop.het,
          n.invariant = n.invariant,
          aHet = n.invariant > 0,
          boot_method = boot.method,
          R = nboots,
          parallel = parallel,
          ncpus = ncpus
        )
        return(res_boots)
      })
      
      # confidence intervals
      
      # creating matrices to store CI
      nparams <- ncol(pop_boot[[1]]$t)
      res_CI <- replicate(npops,
                          as.data.frame(matrix(nrow = nparams, ncol = 2)),
                          simplify = FALSE)
      
      if(nparams == 4) {
        pop_res <- rbind(Ho, Hexp, uHexp, FIS)
      } else {
        pop_res <- rbind(Ho.adj, Hexp.adj)
      }
      
      for (pop_n in 1:length(sgl)) {
        for (stat_n in seq_len(nparams)) {
          res_CI_tmp <- boot::boot.ci(
            boot.out = pop_boot[[pop_n]],
            conf = conf,
            type = CI.type,
            index = stat_n
            # ,
            # t0 =  pop_res[stat_n, pop_n],
            # t = pop_boot[[pop_n]]$t[, stat_n]
          )
          
          res_CI[[pop_n]][stat_n,] <-
            tail(as.vector(res_CI_tmp[[4]]), 2)
          
        }
      }
      
      # Build df with CI
      params_names <- names(df.params)[grep(pattern = "SD|SE", x = names(df.params), invert = TRUE)]
      lCI_df <- lapply(seq_len(nparams), function(rn, lres=res_CI, nms=params_names) {
        df <- do.call(rbind, lapply(lres, "[", rn,))
        names(df) <- paste0(params_names[rn], c("LCI", "HCI"))
        return(df)
      })
      df.CI <- do.call(cbind, lCI_df)
      df <- cbind(df, df.CI)
      
      # set up name order for df
      sfx <- c("", "SD", "SE", "LCI", "HCI")
      nms.order <- vector("character", length = nparams * length(sfx))
      for(rn in seq_len(nparams)) {
        nms.order[seq_len(length(sfx)) + length(sfx) * (rn - 1)] <- 
          paste0(params_names[rn], sfx)
      }
      # cbind CI to existing results
      df <- df[, c(names(df.base), nms.order)]
    }
    
    if (plot.display) {
      res.mean <- subsample <- error_L <- error_H <- value <- color <- variable <- He.adj <- res_SE <- NULL
      
      # printing plots and reports assigning colors to populations
      if (is(plot.colors.pop, "function")) {
        colors_pops <- plot.colors.pop(length(levels(pop(x))))
      }
      
      if (!is(plot.colors.pop, "function")) {
        colors_pops <- plot.colors.pop
      }
      
      if (n.invariant == 0) {
        
        pop_list_plot <- df
        pop_list_plot$pop <- as.factor(pop_list_plot$pop)
        pop_list_plot$color <- colors_pops
        
        pop_list_plot_stat <- pop_list_plot[,c("Ho", "uHe", "FIS", "n.Ind",  "pop",  "color")]
        pop_list_plot_stat <- reshape2::melt(pop_list_plot_stat, id = c("pop", "color", "n.Ind"))
        
        if(error.bar=="SD"){
          pop_list_plot_error <- pop_list_plot[,c("HoSD", "uHeSD", "FISSD","pop")]
          pop_list_plot_error <- reshape2::melt(pop_list_plot_error,id = c("pop"))
          colnames(pop_list_plot_error) <- c("pop","variable","error")
          pop_list_plot_error <- pop_list_plot_error[,c("pop","error")]
          pop_list_plot_stat <- cbind(pop_list_plot_stat,error=pop_list_plot_error$error)
        }
        
        if(error.bar=="SE"){
          pop_list_plot_error <- pop_list_plot[,c("HoSE","uHeSE","FISSE","pop")]
          pop_list_plot_error <- reshape2::melt(pop_list_plot_error,id = c("pop"))
          colnames(pop_list_plot_error) <- c("pop","variable","error")
          pop_list_plot_error <- pop_list_plot_error[,c("pop","error")]
          pop_list_plot_stat <- cbind(pop_list_plot_stat,error=pop_list_plot_error$error)
          }
        
        if(error.bar=="CI"){
          pop_list_plot_error_L <- pop_list_plot[,c("HoLCI","uHeLCI","FISLCI","pop")]
          pop_list_plot_error_H <- pop_list_plot[,c("HoHCI","uHeHCI","FISHCI","pop")]
          pop_list_plot_error_L <- reshape2::melt(pop_list_plot_error_L,id = c("pop"))
          pop_list_plot_error_H <- reshape2::melt(pop_list_plot_error_H,id = c("pop"))
          colnames(pop_list_plot_error_L) <- c("pop","variable_L","error_L")
          colnames(pop_list_plot_error_H) <- c("pop","variable_H","error_H")
          pop_list_plot_error <- cbind(pop_list_plot_error_L,pop_list_plot_error_H)
          pop_list_plot_stat <- cbind(pop_list_plot_stat,
                                      error_L = pop_list_plot_error$error_L,
                                      error_H = pop_list_plot_error$error_H)
          
          }
        
        p3 <-
          ggplot(data = pop_list_plot_stat, aes(x = pop, 
                                                y = value,
                                                fill = pop)) +
          geom_bar(stat = "identity", 
                   color = "black", 
                   position = position_dodge())+ 
          facet_wrap(~variable, nrow=1) +
          scale_fill_manual(values = pop_list_plot_stat$color) +
          scale_x_discrete(labels = paste(pop_list_plot_stat$pop,
                                          round(pop_list_plot_stat$n.Ind,
                                                0),
                                          sep = " | ")) +
          plot.theme +
          theme(
            axis.ticks.x = element_blank(),
            axis.text.x = element_text(
              angle = 90,
              hjust = 1,
              face = "bold",
              size = 12
            ),
            axis.title.x = element_blank(),
            axis.ticks.y = element_blank(),
            axis.title.y = element_blank(),
            legend.position = "none"
          ) 
        
        if(error.bar=="SD"){
          p3 <- p3 + 
            geom_errorbar(aes(ymin = value, 
                              ymax = value + error), 
                          width=0.5)+
            ggtitle(label = "Heterozygosities and FIS by Population",
                    subtitle = "Error bars show Standard Deviation")
        }
        
        if(error.bar=="SE"){
          p3 <- p3 + 
            geom_errorbar(aes(ymin = value - error, 
                              ymax = value + error), 
                          width=0.5) +
            ggtitle(label = "Heterozygosities and FIS by Population",
                    subtitle = "Error bars show Standard Error")
        }
        
        if(error.bar=="CI"){
          p3 <- p3 + 
            geom_errorbar(aes(ymin = error_L, 
                              ymax = error_H), 
                          width=0.5)+
            ggtitle(label = "Heterozygosities and FIS by Population",
                    subtitle = "Error bars show Confidence Intervals")
        }
        
        # Subsampling populations as in Schmidt (2021)
        if(subsample.pop==TRUE){
          
          res_sub_plot <- res_sub
          res_sub_plot$pop <- as.factor(res_sub_plot$pop)
          res_sub_plot$subsample <- as.factor(res_sub_plot$subsample )
          res_sub_plot$color <- rep(colors_pops,each=5)
          
          res_sub_plot_2 <- reshape2::melt(res_sub_plot, id = c("pop", "color", "subsample","res_SE"))
          
          p4 <- ggplot(res_sub_plot_2,aes(x=subsample,y=value))+
            
            geom_bar(stat = "identity", 
                     color = "black", 
                     fill = res_sub_plot_2$color,
                     position = position_dodge())+ 
            facet_wrap(~variable, nrow=1) +
            facet_wrap(~pop)+
            geom_errorbar(aes(ymin = value, 
                              ymax = value + res_SE), 
                          width=0.5)+
            ggtitle(label = "Observed heterozygosity in populations subsamples",
                    subtitle = "Error bars show Standard Error") +
            plot.theme +
            theme(
              axis.ticks.x = element_blank(),
              axis.text.x = element_text(
                face = "bold",
                size = 12
              ),
              axis.title.x = element_blank(),
              axis.ticks.y = element_blank(),
              axis.title.y = element_blank(),
              legend.position = "none"
            )
          
          p3 <-  p3 / p4
          
        }
        
      } else {

        df.ordered <- df
        df.ordered$color <- colors_pops
        df.ordered <- df.ordered[order(df.ordered$Ho.adj), ]
        df.ordered$pop <- factor(df.ordered$pop, levels = df.ordered$pop)
        p1 <-
          ggplot(df.ordered, aes(
            x = pop,
            y = Ho.adj,
            fill = pop
          )) + geom_bar(position = "dodge",
                        stat = "identity",
                        color = "black") +
          scale_fill_manual(values = df.ordered$color) +
          scale_x_discrete(labels = paste(df.ordered$pop,
                                          round(df.ordered$n.Ind, 0),
                                          sep = " | ")) + plot.theme + theme(
                                            axis.ticks.x = element_blank(),
                                            axis.text.x = element_blank(),
                                            axis.title.x = element_blank(),
                                            axis.ticks.y = element_blank(),
                                            axis.title.y = element_blank(),
                                            legend.position = "none"
                                          ) +
          labs(fill = "Population") +
          ggtitle("Adjusted Observed Heterozygosity by Population")
        
        p2 <-
          ggplot(df.ordered, aes(
            x = pop,
            y = He.adj,
            fill = pop
          )) + geom_bar(position = "dodge",
                        stat = "identity",
                        color = "black") +
          scale_fill_manual(values = df.ordered$color) +
          scale_x_discrete(labels = paste(df.ordered$pop,
                                          round(df.ordered$n.Ind, 0),
                                          sep = " | ")) + plot.theme + theme(
                                            axis.ticks.x = element_blank(),
                                            axis.text.x = element_text(
                                              angle = 90,
                                              hjust = 1,
                                              face = "bold",
                                              size = 12
                                            ),
                                            axis.title.x = element_blank(),
                                            axis.ticks.y = element_blank(),
                                            axis.title.y = element_blank(),
                                            legend.position = "none"
                                          ) +
          labs(fill = "Population") +
          ggtitle("Adjusted Expected Heterozygosity by Population")
        
        p3 <- (p1 / p2)
      }
    }
    
    # OUTPUT REPORT
    if (verbose >= 3) {
      cat("  Reporting Heterozygosity by Population\n")
      cat("\n  No. of loci =", nLoc(x), "\n")
      cat("  No. of individuals =", nInd(x), "\n")
      cat("  No. of populations =", nPop(x), "\n")
      if (n.invariant == 0) {
      cat("    Minimum Observed Heterozygosity: ", round(min(df$Ho, na.rm = TRUE), 6), "\n")
      cat("    Maximum Observed Heterozygosity: ", round(max(df$Ho, na.rm = TRUE), 6), "\n")
      cat("    Average Observed Heterozygosity: ", round(mean(df$Ho, na.rm = TRUE), 6), "\n\n")
      
      cat("    Minimum Unbiased Expected Heterozygosity: ",
          round(min(df$uHe, na.rm = TRUE), 6), "\n")
      cat("    Maximum Unbiased Expected Heterozygosity: ",
          round(max(df$uHe, na.rm = TRUE), 6), "\n")
      cat("    Average Unbiased Expected Heterozygosity: ",
          round(mean(df$uHe, na.rm = TRUE), 6), "\n")
      cat("  Heterozygosity estimates not corrected for uncalled invariant loci\n")
      
      } else {
        
        cat("    Minimum Observed adjusted Heterozygosity: ", 
            round(min(df$Ho.adj, na.rm = TRUE), 6), "\n")
        cat("    Maximum Observed adjusted Heterozygosity: ", 
            round(max(df$Ho.adj, na.rm = TRUE), 6), "\n")
        cat("    Average Observed adjusted Heterozygosity: ",
            round(mean(df$Ho.adj, na.rm = TRUE), 6), "\n\n")
        cat("    Minimum Unbiased adjusted Expected Heterozygosity: ", 
            round(min(df$He.adj, na.rm = TRUE), 6), "\n")
        cat("    Maximum Unbiased adjusted Expected Heterozygosity: ", 
            round(max(df$He.adj, na.rm = TRUE), 6), "\n")
        cat("    Average Unbiased adjusted Expected Heterozygosity: ",
            round(mean(df$He.adj, na.rm = TRUE), 6), "\n\n")
        cat(
          "  Average correction factor for invariant loci =",
          mean(n_loc / (n_loc + n.invariant), na.rm = TRUE),
          "\n"
        )
      }
    }
      
    # PRINTING OUTPUTS
    if (plot.display) {
      suppressWarnings(print(p3))
    }
    if (verbose >= 2) {
      # if (n.invariant > 0) {
        print(df)
      # } else {
      #   print(df[, c(
      #     "pop",
      #     "n.Ind",
      #     "n.Loc",
      #     "polyLoc",
      #     "monoLoc",
      #     "all_NALoc",
      #     "Ho",
      #     "HoSD",
      #     "He",
      #     "HeSD",
      #     "uHe",
      #     "uHeSD",
      #     "FIS",
      #     "FISSD"
      #   )], row.names = FALSE)
      # }
    }
  }
  
  ########### FOR METHOD BASED ON INDIVIDUAL
  
  if (method == "ind") {
    if (verbose >= 2) {
      cat(report("  Calculating observed heterozygosity for individuals\n"))
      cat(report(
        "  Note: No adjustment for invariant loci (n.invariant set to 0)\n"
      ))
    }
    # Convert to matrix
    m <- as.matrix(x)
    
    # For each individual determine counts of hets, homs and NAs
    c.na <- array(NA, nInd(x))
    c.hets <- array(NA, nInd(x))
    c.hom0 <- array(NA, nInd(x))
    c.hom2 <- array(NA, nInd(x))
    c.nloc <- array(NA, nInd(x))
    for (i in 1:nInd(x)) {
      c.na[i] <- sum(is.na(m[i, ]))
      c.hets[i] <-
        sum(m[i, ] == 1, na.rm = TRUE) / (nLoc(x) - c.na[i])
      c.hom0[i] <-
        sum(m[i, ] == 0, na.rm = TRUE) / (nLoc(x) - c.na[i])
      c.hom2[i] <-
        sum(m[i, ] == 2, na.rm = TRUE) / (nLoc(x) - c.na[i])
      c.nloc[i] <- (nLoc(x) - c.na[i])
    }
    
    # Join the sample sizes with the heterozygosities
    df <-
      cbind.data.frame(x@ind.names, c.hets, c.hom0, c.hom2,c.nloc)
    names(df) <-
      c("ind.name", "Ho", "f.hom.ref", "f.hom.alt","n.Loc")
    
    # Boxplot
    if (plot.display) {
      upper <- ceiling(max(df$Ho) * 10) / 10
      p1 <-
        ggplot(df, aes(y = Ho)) +
        geom_boxplot(color = plot.colors.ind[1], fill = plot.colors.ind[2]) +
        coord_flip() +
        plot.theme +
        xlim(range = c(-1, 1)) +
        ylim(0, upper) +
        ylab(" ") +
        theme(axis.text.y = element_blank(), axis.ticks.y = element_blank()) +
        ggtitle("Observed Heterozygosity by Individual")
      
      # Histogram
      p2 <-
        ggplot(df, aes(x = Ho)) +
        geom_histogram(bins = 25,
                       color = plot.colors.ind[1],
                       fill = plot.colors.ind[2]) +
        coord_cartesian(xlim = c(0, upper)) +
        xlab("Observed heterozygosity") +
        ylab("Count") +
        plot.theme
    }
    
    if (plot.display) 
      {
      outliers_temp <-
      ggplot_build(p1)$data[[1]]$outliers[[1]]
    outliers <-
      data.frame(ID = as.character(df$ind.name[df$Ho %in% outliers_temp]),
                 Ho = outliers_temp)
    }
    
    # OUTPUT REPORT
    if (verbose >= 3) {
      cat("Reporting Heterozygosity by Individual\n")
      cat("No. of loci =", nLoc(x), "\n")
      cat("No. of individuals =", nInd(x), "\n")
      cat("  Minimum Observed Heterozygosity: ",
          round(min(df$Ho), 6),
          "\n")
      cat("  Maximum Observed Heterozygosity: ",
          round(max(df$Ho), 6),
          "\n")
      cat("  Average Observed Heterozygosity: ",
          round(mean(df$Ho), 6),
          "\n\n")
      cat("  Results returned as a dataframe\n\n")
      if (nrow(outliers) == 0) {
        cat("  No outliers detected\n\n")
      } else {
        cat("  Outliers detected\n")
        print(outliers)
        cat("\n")
      }
    }
    
    # PRINTING OUTPUTS
    if (plot.display) {
      p3 <- (p1 / p2) + plot_layout(heights = c(1, 4))
      print(p3)
    }
    if (verbose >= 2) {
      if(subsample.pop==TRUE){
        print(res_sub, row.names = FALSE)
      }
      print(df, row.names = FALSE)
    }
  }
  
  # Optionally save the plot ---------------------
  
  if(!is.null(plot.file)){
    tmp <- utils.plot.save(p3,
                           dir=plot.dir,
                           file=plot.file,
                           verbose=verbose)
  }
  
  if (verbose >= 3) {
    cat(report("  Returning a dataframe with heterozygosity values\n"))
  }
  
  # FLAG SCRIPT END
  if (verbose >= 1) {
    cat(report("Completed:", funname, "\n"))
  }
  
  # RETURN
  if(subsample.pop==TRUE){
   return(invisible(list(res_sub,df)))
  }else{
  return(invisible(df))
  }
  
}

Try the dartR.base package in your browser

Any scripts or data that you put into this service are public.

dartR.base documentation built on April 4, 2025, 2:45 a.m.