R/gl.pcoa.r

Defines functions gl.pcoa

Documented in gl.pcoa

#'@name gl.pcoa
#'@title Ordination applied to genotypes in a genlight object (PCA), in an fd
#'object, or to a distance matrix (PCoA)
#'@description
#' This function takes the genotypes for individuals and undertakes a Pearson
#' Principal Component analysis (PCA) on SNP or Sequence tag P/A (SilicoDArT) data; it
#' undertakes a Gower Principal Coordinate analysis (PCoA) if supplied with a
#' distance matrix. 

#' @param x Name of the genlight object or fd object containing the SNP data, or
#'  a distance matrix of type dist [required].
#' @param nfactors Number of axes to retain in the output of factor scores
#' [default 5].
#' @param pc.select Method for identifying substantial PC axes. One of Kaiser-Guttman,
#' broken-stick, Tracy-Widom [default broken-stick]
#' @param correction Method applied to correct for negative eigenvalues, either
#'  'lingoes' or 'cailliez' [Default NULL].
#' @param mono.rm If TRUE, remove monomorphic loci [default TRUE].
#' @param parallel TRUE if parallel processing is required (does fail under
#' Windows) [default FALSE].
#' @param n.cores Number of cores to use if parallel processing is requested
#' [default 16].
#' @param plot.out If TRUE, a diagnostic plot is displayed showing a scree plot
#' for the "informative" axes and a histogram of eigenvalues of the remaining 
#' "noise" axes [Default TRUE].
#' @param plot.theme Theme for the plot. See Details for options
#'  [default theme_dartR()].
#' @param plot.colors List of two color names for the borders and fill of the
#'  plot [default gl.colors(2)].
#' @param plot.dir Directory to save the plot RDS files [default as specified 
#' by the global working directory or tempdir()]
#' @param plot.file Name for the RDS binary file to save (base name only, exclude extension) [default NULL]
#' @param verbose verbose= 0, silent or fatal errors; 1, begin and end; 2,
#' progress log; 3, progress and results summary; 5, full report
#' [default 2 or as specified using gl.set.verbosity].

#' @details
#'The function is essentially a wrapper for glPca \{adegenet\} or pcoa \{ape\}
#'with default settings apart from those specified as parameters in this
#'function.
#'\strong{ Sources of stress in the visual representation }
#' While, technically, any distance matrix can be represented in an ordinated
#' space, the representation will not typically be exact.There are three major
#' sources of stress in a reduced-representation of distances or dissimilarities
#' among entities using PCA or PCoA. By far the greatest source comes from the
#' decision to select only the top two or three axes from the ordinated set of
#' axes derived from the PCA or PCoA. The representation of the entities such a
#' heavily reduced space will not faithfully represent the distances in the
#' input distance matrix simply because of the loss of information in deeper
#' informative dimensions. For this reason, it is not sensible to be too
#' precious about managing the other two sources of stress in the visual
#' representation.

#' The measure of distance between entities in a PCA is the Pearson Correlation
#' Coefficient, essentially a standardized Euclidean distance. This is both a
#' metric distance and a Euclidean distance and so the distances in the final 
#' ordination are faithful to those between entities in
#' the dataset. Note that missing values are imputed in PCA, and that this can be
#' a source of disparity between the distances between the entities in the dataset
#' and the distances in the ordinated space.
#' 
#' In PCoA, the second source of stress is the choice of distance 
#' measure or dissimilarity measure. While pretty much any distance or dissimilarity matrix 
#' can be represented in an ordinated space, the distances between entities can 
#' be faithfully represented in that space (that is, without stress) only if the 
#' distances are metric. Furthermore, for distances between entities to be 
#' faithfully represented in a rigid Cartesian space, the distance measure 
#' needs to be Euclidean. If this is not the case, the distances between the 
#' entities in the ordinated visualized space will not #' exactly represent 
#' the distances in the input matrix (stress will be non-zero).
#' This source of stress will be evident as negative eigenvalues in the deeper
#' dimensions. 
#' 
#' A third source of stress arises from having a sparse dataset, one with
#' missing values. This affects both PCA and PCoA. If the original data matrix
#' is not fully populated, that is, if there are missing values, then even a
#' Euclidean distance matrix will not necessarily be 'positive definite'. It
#' follows that some of the eigenvalues may be negative, even though the
#' distance metric is Euclidean. This issue is exacerbated when the number of
#' loci greatly exceeds the number of individuals, as is typically the case when
#' working with SNP data. The impact of missing values can be minimized by
#' stringently filtering on Call Rate, albeit with loss of data. An alternative
#' is given in a paper 'Honey, I shrunk the sample covariance matrix' and more
#' recently by Ledoit and Wolf (2018), but their approach has not been
#'  implemented here.
#'  
#'  Options for imputing missing values while minimizing distortion are provided in 
#'  the function gl.impute().

#' The good news is that, unless the sum of the negative eigenvalues, arising
#' from a non-Euclidean distance measure or from missing values, approaches
#' those of the final PCA or PCoA axes to be displayed, the distortion is
#' probably of no practical consequence and certainly not comparable to the
#' stress arising from selecting only two or three final dimensions out of
#' several informative dimensions for the visual representation.

#'\strong{ Function's output }

#' Two diagnostic plots are produced. The first is a Scree Plot, showing the
#' percentage variation explained by each of the PCA or PCoA axes, for those
#' axes are considered informative. The scree plot informs a decision on the number of
#' dimensions to be retained in the visual summaries. Various approaches are available
#' for identifying which axes are informative (in terms of containing biologically
#' significant variation) and which are noise axes. The simplest method is to consider
#' only those axes that explain more variance than the original variables on average as
#' being informative (pc.select="Kaiser-Guttman"). A second method (the default) is
#' the broken-stick method (pc.select="broken-stick"). A third method is the Tracy-Widom
#' statistical approach (pc.select="Tracy-Widom").
#' 
#' Once you have the informative axes identified, a judgement call is made as to how
#' many dimensions to retain and present as results. This requires a decision on how 
#' much information on structure in the data is to be discarded. Retaining at least those axes 
#' that explain 10% or more of variation is used by some as a rule of thumb.

#' The second graph is for diagnostic purposes only. It shows the distribution of 
#' eigenvalues for the remaining uninformative (noise) axes, including those with 
#' negative eigenvalues. If a plot.file is given, the ggplot arising from this function 
#' is saved as an binary file using gl.save(); can be reloaded with gl.load(). 
#' A file name must be  specified for the plot to be saved.
#' If a plot directory (plot.dir) is specified, the ggplot binary is saved to that
#' directory; otherwise to the tempdir(). 
#' Action is recommended (verbose >= 2) if the negative eigenvalues are
#' dominant, their sum approaching in magnitude the eigenvalues for axes
#' selected for the final visual solution.

#' Output is a glPca object conforming to adegenet::glPca but with only the
#' following retained.
#'\itemize{
#'\item  $call - The call that generated the PCA/PCoA
#'\item  $eig - Eigenvalues -- All eigenvalues (positive, null, negative).
#'\item  $scores - Scores (coefficients) for each individual
#'\item  $loadings - Loadings of each SNP for each principal component
#'    }
#'
#'
#'  Examples of other themes that can be used can be consulted in \itemize{
#'  \item \url{https://ggplot2.tidyverse.org/reference/ggtheme.html} and \item
#'  \url{https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/}
#'  }
#'
#' PCA was developed by Pearson (1901) and Hotelling (1933), whilst the best
#' modern reference is Jolliffe (2002). PCoA was developed by Gower (1966) while
#' the best modern reference is Legendre & Legendre (1998).

#'@return An object of class pcoa containing the eigenvalues and factor scores
#'@author Author(s): Arthur Georges and Jesus Castrejon. Custodian: Arthur Georges (Post to
#'\url{https://groups.google.com/d/forum/dartr})
#'@examples
#' # PCA (using SNP genlight object)
#' gl <- possums.gl
#' pca <- gl.pcoa(possums.gl[1:50,],verbose=2)
#' gl.pcoa.plot(pca,gl)
#' \donttest{
#' gs <- testset.gs
#' levels(pop(gs))<-c(rep('Coast',5),rep('Cooper',3),rep('Coast',5),
#' rep('MDB',8),rep('Coast',6),'Em.subglobosa','Em.victoriae')
#' # PCA (using SilicoDArT genlight object)
#' pca <- gl.pcoa(gs)
#' gl.pcoa.plot(pca,gs)
#' # Using a distance matrix
#' D <- gl.dist.ind(testset.gs, method='jaccard')
#' pcoa <- gl.pcoa(D,correction="cailliez")
#' gl.pcoa.plot(pcoa,gs)
#' }
#'@references
#'\itemize{
#'\item Cailliez, F. (1983) The analytical solution of the additive constant
#'problem. Psychometrika, 48, 305-308.
#'\item Gower, J. C. (1966) Some distance properties of latent root and vector
#'methods used in multivariate analysis. Biometrika, 53, 325-338.
#'\item Hotelling, H., 1933. Analysis of a complex of statistical variables into
#' Principal Components. Journal of Educational Psychology 24:417-441, 498-520.
#'\item Jolliffe, I. (2002) Principal Component Analysis. 2nd Edition, Springer,
#' New York.
#'\item Ledoit, O. and Wolf, M. (2018). Analytical nonlinear shrinkage of
#'large-dimensional covariance matrices. University of Zurich, Department of
#'Economics, Working Paper No. 264, Revised version. Available at SSRN:
#'https://ssrn.com/abstract=3047302 or http://dx.doi.org/10.2139/ssrn.3047302
#'\item Legendre, P. and Legendre, L. (1998). Numerical Ecology, Volume 24, 2nd
#'Edition. Elsevier Science, NY.
#'\item Lingoes, J. C. (1971) Some boundary conditions for a monotone analysis
#'of symmetric matrices. Psychometrika, 36, 195-203.
#'\item Pearson, K. (1901). On lines and planes of closest fit to systems of
#'points in space. Philosophical Magazine. Series 6, vol. 2, no. 11, pp.
#'559-572.
#' }
#' @seealso \code{\link{gl.pcoa.plot}}
#' @family data exploration functions
#' @importFrom ape pcoa
#' @export

gl.pcoa <- function(x,
                    nfactors = 5,
                    pc.select="broken-stick",
                    correction = NULL,
                    mono.rm = TRUE,
                    parallel = FALSE,
                    n.cores = 1,
                    plot.out = TRUE,
                    plot.theme = theme_dartR(),
                    plot.colors = gl.colors(2),
                    plot.file=NULL,
                    plot.dir=NULL,
                    verbose = NULL) {
  
    # SET VERBOSITY
    verbose <- gl.check.verbosity(verbose)
    
    # FLAG SCRIPT START
    funname <- match.call()[[1]]
    utils.flag.start(func = funname,
                     build = "2024V1",
                     verbose = verbose)
    
    # SET WORKING DIRECTORY
    plot.dir <- gl.check.wd(plot.dir,verbose=0)
    
    # CHECK DATATYPE
    datatype <-
        utils.check.datatype(x,
                             accept = c("SNP", "SilicoDArT", "dist", "fd"),
                             verbose = verbose)
    
    if (datatype == "SNP" | datatype == "SilicoDArT") {
        x <- gl.filter.allna(x, verbose = 0)
    }
    
    if (datatype == "fd") {
        datatype <- utils.check.datatype(x$gl, verbose = verbose)
        x <- x$gl
        x <- gl.filter.allna(x, verbose = 0)
    }
    
    # SCRIPT SPECIFIC ERROR CHECKING
    
    if (mono.rm == TRUE &
        (datatype == "SNP" | datatype == "SilicoDArT")) {
        x <- gl.filter.monomorphs(x, verbose = 0)
    }
    
    if((datatype == "SNP" | datatype == "SilicoDArT")){
        if (nInd(x) < 2) {
        stop(
            error(
                "Fatal Error: Only one individual or no individuals present in the dataset\n"
            )
        )
        }
   
        if (nLoc(x) < nInd(x)) {
        if(verbose >=2){cat(
            warn(
                "  Warning: Number of loci is less than the number of individuals to be represented\n"
            )
        )}
        }
    }
    
    if (is.null(correction)) {
        correction <- "none"
    } else {
        correction <- tolower(correction)
        if (correction != "lingoes" &&
            correction != "cailliez") {
            if (verbose >= 2) {
                cat(
                    warn(
                        "  Warning: Correction if specified needs to be lingoes or cailliez, set to the default 'None'\n"
                    )
                )
            }
            correction <- "none"
        }
    }
    
    # if(pc.select=="Tracy-Widom"){
    #   cat(warn("  Note: The Tracy-Widom Criterion for determining informative axes is not yet implemented. Selecting 'broken-stick criterion'\n"))
    #   pc.select <- "broken-stick"
    # }
    
    # FUNCTIONS
    
    twtest <- function(x){
      # Returns the p-value of the Tracy Widom statistics 
      # (righ-side cumulative function). The function uses linear interpolation 
      # over the tables reported originally by Patterson et al (2006).
      
      # The package RMTstat provides the same values (1-ptw(x)) with more precision but for smaller values
      
      # twtest(0.6)
      
      # Author: Jesus Castrejon
      
      # This array holds a pre-calculated table for the Tracy-Widom distribution.
      # Each group of three numbers corresponds to a value `x` from the distribution,
      # the density at `x`, and the cumulative distribution up to `x`.
      twtable <- c(-8.000000000,1.000000000,0.000000000,-7.900000000,1.000000000,0.000000000,-7.800000000,1.000000000,0.000000000,-7.700000000,1.000000000,0.000000000,-7.600000000,1.000000000,0.000000000,
                   -7.500000000,1.000000000,0.000000001,-7.400000000,1.000000000,0.000000002,-7.300000000,0.999999999,0.000000005,-7.200000000,0.999999999,0.000000010,-7.100000000,0.999999997,0.000000019,
                   -7.000000000,0.999999995,0.000000039,-6.900000000,0.999999989,0.000000076,-6.800000000,0.999999978,0.000000146,-6.700000000,0.999999958,0.000000276,-6.600000000,0.999999920,0.000000511,
                   -6.500000000,0.999999849,0.000000932,-6.400000000,0.999999723,0.000001670,-6.300000000,0.999999498,0.000002942,-6.200000000,0.999999105,0.000005097,-6.100000000,0.999998431,0.000008683,
                   -6.000000000,0.999997293,0.000014554,-5.900000000,0.999995401,0.000024005,-5.800000000,0.999992309,0.000038969,-5.700000000,0.999987331,0.000062279,-5.600000000,0.999979441,0.000098012,
                   -5.500000000,0.999967125,0.000151923,-5.400000000,0.999948187,0.000231995,-5.300000000,0.999919496,0.000349097,-5.200000000,0.999876655,0.000517756,-5.100000000,0.999813597,0.000757035,
                   -5.000000000,0.999722082,0.001091485,-4.900000000,0.999591101,0.001552137,-4.800000000,0.999406175,0.002177466,-4.700000000,0.999148569,0.003014256,-4.600000000,0.998794427,0.004118267,
                   -4.500000000,0.998313849,0.005554591,-4.400000000,0.997669962,0.007397591,-4.300000000,0.996818016,0.009730295,-4.200000000,0.995704571,0.012643159,-4.100000000,0.994266851,0.016232112,
                   -4.000000000,0.992432322,0.020595851,-3.900000000,0.990118582,0.025832397,-3.800000000,0.987233631,0.032034971,-3.700000000,0.983676579,0.039287325,-3.600000000,0.979338843,0.047658716,
                   -3.500000000,0.974105853,0.057198759,-3.400000000,0.967859270,0.067932445,-3.300000000,0.960479677,0.079855636,-3.200000000,0.951849687,0.092931337,-3.100000000,0.941857369,0.107087044,
                   -3.000000000,0.930399881,0.122213418,-2.900000000,0.917387157,0.138164458,-2.800000000,0.902745495,0.154759279,-2.700000000,0.886420892,0.171785501,-2.600000000,0.868381957,0.189004169,
                   -2.500000000,0.848622271,0.206156009,-2.400000000,0.827162053,0.222968755,-2.300000000,0.804049066,0.239165233,-2.200000000,0.779358684,0.254471803,-2.100000000,0.753193114,0.268626779,
                   -2.000000000,0.725679802,0.281388431,-1.900000000,0.696969061,0.292542221,-1.800000000,0.667231036,0.301906945,-1.700000000,0.636652122,0.309339558,-1.600000000,0.605430961,0.314738516,
                   -1.500000000,0.573774198,0.318045543,-1.400000000,0.541892124,0.319245849,-1.300000000,0.509994383,0.318366852,-1.200000000,0.478285870,0.315475570,-1.100000000,0.446962951,0.310674866,
                   -1.000000000,0.416210105,0.304098784,-0.900000000,0.386197065,0.295907232,-0.800000000,0.357076521,0.286280263,-0.700000000,0.328982392,0.275412215,-0.600000000,0.302028689,0.263505933,
                   -0.500000000,0.276308949,0.250767272,-0.400000000,0.251896179,0.237400053,-0.300000000,0.228843301,0.223601597,-0.200000000,0.207183986,0.209558915,-0.100000000,0.186933854,0.195445624,
                   0.000000000,0.168091934,0.181419571,0.100000000,0.150642330,0.167621190,0.200000000,0.134556018,0.154172511,0.300000000,0.119792709,0.141176787,0.400000000,0.106302721,0.128718659,
                   0.500000000,0.094028817,0.116864772,0.600000000,0.082907953,0.105664756,0.700000000,0.072872924,0.095152500,0.800000000,0.063853860,0.085347620,0.900000000,0.055779577,0.076257058,
                   1.000000000,0.048578763,0.067876743,1.100000000,0.042180992,0.060193257,1.200000000,0.036517582,0.053185457,1.300000000,0.031522284,0.046826015,1.400000000,0.027131832,0.041082856,
                   1.500000000,0.023286351,0.035920459,1.600000000,0.019929640,0.031301023,1.700000000,0.017009350,0.027185487,1.800000000,0.014477062,0.023534398,1.900000000,0.012288293,0.020308645,
                   2.000000000,0.010402429,0.017470054,2.100000000,0.008782605,0.014981856,2.200000000,0.007395547,0.012809046,2.300000000,0.006211384,0.010918644,2.400000000,0.005203434,0.009279861,
                   2.500000000,0.004347977,0.007864200,2.600000000,0.003624031,0.006645482,2.700000000,0.003013114,0.005599836,2.800000000,0.002499018,0.004705636,2.900000000,0.002067590,0.003943413,
                   3.000000000,0.001706520,0.003295741,3.100000000,0.001405143,0.002747112,3.200000000,0.001154255,0.002283795,3.300000000,0.000945945,0.001893694,3.400000000,0.000773431,0.001566204,
                   3.500000000,0.000630927,0.001292071,3.600000000,0.000513508,0.001063253,3.700000000,0.000416999,0.000872795,3.800000000,0.000337871,0.000714702,3.900000000,0.000273152,0.000583831,
                   4.000000000,0.000220344,0.000475784,4.100000000,0.000177359,0.000386816,4.200000000,0.000142452,0.000313749,4.300000000,0.000114170,0.000253894,4.400000000,0.000091308,0.000204987,
                   4.500000000,0.000072871,0.000165125,4.600000000,0.000058035,0.000132716,4.700000000,0.000046124,0.000106431,4.800000000,0.000036582,0.000085163,4.900000000,0.000028955,0.000067996,
                   5.000000000,0.000022872,0.000054172,5.100000000,0.000018030,0.000043066,5.200000000,0.000014185,0.000034164,5.300000000,0.000011138,0.000027045,5.400000000,0.000008728,0.000021365,
                   5.500000000,0.000006826,0.000016843,5.600000000,0.000005328,0.000013250,5.700000000,0.000004151,0.000010403,5.800000000,0.000003228,0.000008151,5.900000000,0.000002505,0.000006374,
                   6.000000000,0.000001941,0.000004974,6.100000000,0.000001501,0.000003874,6.200000000,0.000001158,0.000003011,6.300000000,0.000000892,0.000002336,6.400000000,0.000000686,0.000001809,
                   6.500000000,0.000000527,0.000001398,6.600000000,0.000000403,0.000001078,6.700000000,0.000000308,0.000000830,6.800000000,0.000000235,0.000000638,6.900000000,0.000000179,0.000000489,
                   7.000000000,0.000000136,0.000000375,7.100000000,0.000000104,0.000000286,7.200000000,0.000000079,0.000000218,7.300000000,0.000000059,0.000000166,7.400000000,0.000000045,0.000000126,
                   7.500000000,0.000000034,0.000000096,7.600000000,0.000000025,0.000000073,7.700000000,0.000000019,0.000000055,7.800000000,0.000000014,0.000000041,7.900000000,0.000000011,0.000000031,
                   8.000000000,0.000000008,0.000000023)
      # If the input `x` is 8 or greater, the function returns 0 because the table 
      # does not include values at or beyond this point.
      if(x >= 8){
        return(0)
      }
      
      i <- 0
      # Search through twtable to find the right index where the value of `x` 
      # fits between entries in the table.
      while(i < 162 & x >= twtable[(i)*3+1]){
        i <- i+1
      }
      # lower boundary 
      z0 = twtable[(i-1)*3 + 2]
      # upper boundary
      z1 = twtable[(i)*3 + 2]
      # If at the end of the table, return the density at the lower boundary.
      if(i == 162){
        return(z0)
        # x is less than the smallest x in the table, return the density at 
        # the upper boundary
      } else if(i == 0){
        return(z1)
        # x is within the range of the table, interpolate
      } else{
        # Linear interpolation to find the density at x
        result <- z0 + (z1- z0)*(x - twtable[(i-1)*3+1])/(twtable[(i)*3+1] - twtable[(i-1)*3+1])
      }
      return(result)
    }

    tw.statistics <- function(eigenvalues,alpha=0.05,plot=TRUE) {
      
      # Returns two vectors, one holding the eigenvalues of "informative" dimensions,
      # the other holding the eigenvalues of the "noise" dimensions. Criterion is
      # statistical significance under the Tracy-Widom distribution (implemented by
      #Patterson, 2006.
      
      # Author: Jesus Castrejon
      
      # Reference: Patterson, N., Price, A. L., & Reich, D. (2006). Population 
      # structure and eigenanalysis. PLoS Genetics, 2, e190. 
      # https://doi.org/10.1371/journal.pgen.0020190
      
      # Takes M as the number of eigenvalues
      M <- length(eigenvalues)
      
      # Sum eigenvalues and squared eigenvalues
      s  <- sum(eigenvalues)
      s2 <- sum(eigenvalues**2.)
      
      # Initialise lists to store values
      index <- 1:length(eigenvalues)
      p.value <- c()
      tw.stat <- c()
      
      # Checks TW statistics for each eigenvalue
      for(i in index){
        Mp <- M-i+1
        nhat <- (Mp+2)*s**2./(Mp*s2 - s**2.)
        lambda <- eigenvalues[i]*Mp/s
        # Fix effective n for very small eigenvalues
        if(nhat<1.){
          nhat <- 1.
        }
        mu <- (sqrt(nhat-1.) + sqrt(Mp))**2./nhat
        sigma <- (sqrt(nhat-1.) + sqrt(Mp))*(1./sqrt(nhat-1) + 1./sqrt(Mp))**(1./3)/nhat
        # TW statistic
        x <- (lambda-mu)/sigma
        # Gets pvalue of the Tracy-Widom statistics
        pvalue <- twtest(x)
        # Updates sum of eigenvalues
        s <- s - eigenvalues[i]
        s2 <- s2 - eigenvalues[i]**2.
        p.value[i] <- pvalue
        tw.stat[i] <- x
      }
      
      # Adjust pvalues
      p.value <- stats::p.adjust(p.value)
      
      # Creates dataframe with eigenvalues and filter into noisy/structure base on pvalue>alpha
      df <- data.frame(index,eigenvalues,tw.stat,p.value)
      idx <- which(p.value>alpha)[1]
      struc <- df[1:(idx-1),]
      noise <- df[(idx):length(eigenvalues),]
      struc$structure <- 'structured'
      noise$structure <- 'noisy'
      
      # Plots eigenvalues
      if(plot){
        p1 <- ggplot() +  
          geom_point(data=struc, aes(x=index,y = eigenvalues,color = structure)) +
          geom_point(data=noise, aes(x=index,y = eigenvalues,color = structure)) +
          scale_y_continuous(trans='log10') + 
          scale_color_manual(values=c("blue","red", "green"))+
          theme(legend.title=element_blank())
        print(p1)
      }
      return(list('struct'=struc,'noise'=noise))
    }
    
    # bs.statistics <- function(eigenvalues,plot=FALSE){
    #   # Returns two vectors, one holding the eigenvalues of "informative" dimensions,
    #   # the other holding the eigenvalues of the "noise" dimensions. Criterion is
    #   # based on the broken-stick algorithm (Macarthur, 1957).
    #   
    #   # Author: Jesus Castrejon
    #   
    #   # Reference: Macarthur, R. (1957). On the relative abundance of bird species. 
    #   # Proceedings of the National Academy of Sciences USA, 43, 293–295. 
    #   # https://doi.org/10.1073/pnas.43.3.293
    #   
    #   # Set number of eigenvalues
    #   n <- length(eigenvalues)
    #   index <- 1:n
    #   # Calculate expected lengths under Broken stick approach
    #   broken.sticks <- sum(eigenvalues)*rev(cumsum(1/n:1))/n
    #   # Creates dataframe with eigenvalues and filter noise/structure based on BS lengths
    #   df <- data.frame(index,eigenvalues,broken.sticks)
    #   #taking the number of the eigenvalue that is less than the broken sticks
    #   idx <- eigenvalues>broken.sticks
    #   idx2 <- which(eigenvalues>broken.sticks)
    #   if (length(idx2) == 0) { #Suggested by ChatGPT
    #     struc <- data.frame()
    #     noise <- df
    #   }
    #   #testing whether there is a gap 
    #   #test1 <- which((c(idx2[-1],idx2[1])-idx2)>2)
    #   test1 <- which(diff(idx2) > 2) # Suggested by ChatGPT
    #   # if there is a gap 
    #   if(length(test1) > 0){
    #     idx[(test1+1):length(idx)] <- FALSE
    #   }
    #   #setting false to everything that is after the gap
    #   struc <- df[which(idx==TRUE),]
    #   noise <- df[which(idx==FALSE),]
    #   struc$structure <- 'structured'
    #   noise$structure <- 'noisy'
    #   
    #   # Creates Plot of eigenvalues
    #   if(plot){
    #     p1 <- ggplot() +  
    #       geom_point(data=struc, aes(x=index,y = eigenvalues,color = structure)) +
    #       geom_point(data=noise, aes(x=index,y = eigenvalues,color = structure)) +
    #       geom_line(data=df, aes(x=index,y = broken.sticks,color = 'broken sticks'),linetype="dashed") +
    #       scale_y_continuous(trans='log10') + 
    #       scale_color_manual(values=c("black","blue", "red")) +
    #       theme(legend.title=element_blank())
    #     print(p1)
    #   }
    #   
    #   return(list('struct'=struc,'noise'=noise))
    # }
    
    bs.statistics <- function(eigenvalues, plot = FALSE, gap_threshold = 2) {
      # Function to separate eigenvalues into "structured" and "noisy" dimensions 
      # using the Broken Stick algorithm (MacArthur, 1957).
      #
      # INPUTS:
      #   - eigenvalues: A numeric vector of eigenvalues (e.g., from PCA or related analyses).
      #   - plot: A logical flag (TRUE/FALSE) to indicate whether to generate a plot
      #           of eigenvalues, broken stick values, and their classifications.
      #   - gap_threshold: An integer specifying the minimum gap between consecutive
      #                    eigenvalues (indices) to classify as noise (default = 2).
      #
      # OUTPUT:
      #   - A list containing two data frames:
      #       - 'struct': Data frame of eigenvalues classified as "structured".
      #       - 'noise': Data frame of eigenvalues classified as "noisy".
      #
      # Author: Jesus Castrejon
      # Enhanced by ChatGPT for clarity, robustness, and parameterization.
      #
      # Reference: MacArthur, R. (1957). On the relative abundance of bird species.
      # Proceedings of the National Academy of Sciences USA, 43, 293–295.
      # https://doi.org/10.1073/pnas.43.3.293
      
      ### Step 1: Initialization ###
      # Number of eigenvalues and their indices
      n <- length(eigenvalues)
      index <- 1:n
      
      ### Step 2: Calculate Broken Stick Thresholds ###
      # Broken stick model for expected eigenvalue lengths
      broken.sticks <- sum(eigenvalues) * rev(cumsum(1 / n:1)) / n
      
      # Create a data frame to hold eigenvalues and their thresholds
      df <- data.frame(index = index, 
                       eigenvalues = eigenvalues, 
                       broken.sticks = broken.sticks)
      
      ### Step 3: Classify Eigenvalues as "Structured" or "Noisy" ###
      # Logical vector: TRUE if eigenvalue > broken stick threshold
      idx <- eigenvalues > broken.sticks
      idx2 <- which(idx)  # Indices of eigenvalues greater than the thresholds
      
      # Handle edge case: No eigenvalues greater than thresholds
      if (length(idx2) == 0) {
        # Return all eigenvalues as "noisy" if no structure is detected
        return(list(struct = data.frame(), noise = df))
      }
      
      # Identify gaps in structured indices
      # - A "gap" occurs if the difference between consecutive indices > gap_threshold
      gaps <- which(diff(idx2) > gap_threshold)
      
      # If gaps exist, mark all eigenvalues after the gap as "noisy"
      if (length(gaps) > 0) {
        idx[(gaps[1] + 1):length(idx)] <- FALSE
      }
      
      # Separate structured and noisy eigenvalues based on updated idx
      struc <- df[idx, ]
      noise <- df[!idx, ]
      
      # Add classification labels to each subset
      struc$structure <- 'structured'
      noise$structure <- 'noisy'
      
      ### Step 4: Optional Plotting ###
      if (plot) {
        # Generate a plot to visualize structured and noisy eigenvalues
        p1 <- ggplot() +
          geom_point(data = struc, aes(x = index, y = eigenvalues, color = structure)) +
          geom_point(data = noise, aes(x = index, y = eigenvalues, color = structure)) +
          geom_line(data = df, aes(x = index, y = broken.sticks, color = 'broken sticks'), 
                    linetype = "dashed") +
          scale_y_continuous(trans = 'log10') +  # Log scale for eigenvalues
          scale_color_manual(values = c("black", "blue", "red")) + 
          theme_minimal() +
          theme(legend.title = element_blank()) +
          labs(title = "Broken Stick Analysis",
               x = "Index",
               y = "Eigenvalues (log10 scale)")
        print(p1)
      }
      
      ### Step 5: Return Results ###
      return(list(struct = struc, noise = noise))
    }
    
    # DO THE JOB
    
    ######## DISTANCE ANALYSIS
    
    if (datatype == "dist"){
        D <- x
        
        # Calculate the pcoa
        if (verbose >= 2) {
            if (correction == "none") {
                cat(
                    report(
                        "  Performing a PCoA, individuals as entities, no correction applied\n"
                    )
                )
                title <-
                    paste0("PCoA on Distance Matrix (no correction)\nScree Plot\n (informative axes only -- ",pc.select,"criterion)")
            } else {
                cat(
                    report(
                        "  Performing a PCoA, individuals as entities,",
                        correction,
                        "correction for stress (-ive eigenvalues) applied\n"
                    )
                )
                title <-
                    paste0(
                        "PCoA on Distance Matrix(",
                        correction,
                        ")\nScree Plot (informative axes only -- ",pc.select,"criterion)")
            }
        }
        pco <- ape::pcoa(D, correction = correction, rn = labels(D))
        
        # Extract relevant variables
        
        if (correction == "none") {
            eig.raw <- pco$values$Eigenvalues
        } else {
            eig.raw <- pco$values$Corr_eig
        }
        # M=length(eig.raw)+1
        
        # Identify the number of axes with explanatory value
        
        if(pc.select=="Kaiser-Guttman"){
          eig.raw.pos <- eig.raw[eig.raw >= 0]
          eig.raw.pos.pc <- eig.raw.pos * 100 / sum(eig.raw.pos)
          eig.top <- eig.raw.pos[eig.raw.pos > mean(eig.raw.pos)] 
          eig.top.pc <- round(eig.top * 100 / sum(eig.raw.pos), 1)
          eig.raw.noise <- eig.raw[eig.raw <= mean(eig.raw)]
        } else if(pc.select=="broken-stick"){
          eig.raw.pos <- eig.raw[eig.raw >= 0]
          eig.raw.pos.pc <- eig.raw.pos * 100 / sum(eig.raw.pos)
          tmp <- bs.statistics(eigenvalues = eig.raw.pos)
          eig.top <- tmp$struct$eigenvalues
          eig.top.pc <- round(eig.top * 100 / sum(eig.raw.pos), 1)
          eig.raw.noise <- tmp$noise$eigenvalues
        } else if(pc.select=="Tracy-Widom"){
          # M <- nInd(x)
          eig.raw.pos <- eig.raw[eig.raw >= 0]
          eig.raw.pos.pc <- eig.raw.pos * 100 / sum(eig.raw.pos)
          tmp <- tw.statistics(eig.raw.pos)
          eig.top <- tmp$struct$eigenvalues
          eig.top.pc <- round(eig.top * 100 / sum(eig.raw.pos), 1)
          eig.raw.noise <- tmp$noise$eigenvalues
        } else {
          stop("Error: incorrect specification for model of significant eigenvalues. Set to Tracy-Widom\n")
          pc.select <- "Tracy-Widom"
        }  
        
        if (any(eig.raw < 0)) {
            if (verbose >= 2) {
                problem <- (-sum(eig.raw[eig.raw < 0])*100 / mean(eig.raw[1:3]))
                cat(
                    warn(
                        "  Warning: Some eigenvalues negative -- sum to",
                        round(problem, 2),
                        "% of the mean eigenvalue for PCoA axes 1-3\n"
                    )
                )
                cat(
                    report(
                        "    Tolerable negative eigenvalues should sum to much less than the eigenvalues of displayed PCoA axes (say, less than 20%)\n"
                    )
                )
                if (problem > 20) {
                    cat(
                        report(
                            "    If the stress (negative eigenvalues) is considered a problem, and you might reasonably choose to ignore it, you have the following options:\n"
                        )
                    )
                    cat(
                        report(
                            "    (a) Apply more stringent filtering on Call Rate before generating the distance matrix; or\n"
                        )
                    )
                    cat(
                        report(
                            "    (b) Select an alternate distance measure, preferably a metric distance or better still, a Euclidean distance, if you have not already; or\n"
                        )
                    )
                    cat(
                        report(
                            "    (c) Apply a transformation (correction) to eliminate the negative eigenvalues. If this was already done, try another correction; or\n"
                        )
                    )
                    cat(
                        report(
                            "    (d) Interpret the visual representation of the ordination with caution, seeking corroborating evidence.\n"
                        )
                    )
                }
            }
        }
        
        # Provide a summary
        if (verbose >= 3) {
            if (correction == "lingoes" | correction == "cailliez") {
                cat(" Correction",
                    correction,
                    "applied to remove negative eigenvalues\n")
                cat(
                    paste(
                        "  Uncorrected ordination yielded",
                        length(eig.top),
                        "informative dimensions(",pc.select,"criterion) from",
                        nInd(x) - 1,
                        "original dimensions\n"
                    )
                )
            } else {
                cat(
                    paste(
                        "  Ordination yielded",
                        length(eig.top),
                        "informative dimensions(",pc.select,"criterion) from",
                        dim(as.matrix(D))[1],
                        "original dimensions\n"
                    )
                )
            }
            cat(paste(
                "    PCoA Axis 1 explains",
                round(eig.raw.pos.pc[1], 1),
                "% of the total variance\n"
            ))
            if(length(eig.top >= 2)){cat(
                paste(
                    "    PCoA Axis 1 and 2 combined explain",
                    round(eig.raw.pos.pc[1] + eig.raw.pos.pc[2], 1),
                    "% of the total variance\n"))
            }
            if(length(eig.top >= 3)){cat(paste(
                "    PCoA Axis 1-3 combined explain",
                round(
                    eig.raw.pos.pc[1] + eig.raw.pos.pc[2] + eig.raw.pos.pc[3],
                    1
                ),
                "% of the total variance\n"))
            }
        }
        # Construct a universal output file
        p.object <- list()
        p.object$scores <- pco$vectors[, 1:nfactors]
        p.object$eig <- pco$values$Eigenvalues
        p.object$loadings <- pco$vectors.cor[, 1:nfactors]
        p.object$call <- match.call()
        
    }  ######## END DISTANCE DATA
    
    ######## SNP or P/A DATA, PCA
    
    if (datatype == "SNP" || datatype == "SilicoDArT")
    {
        if (datatype == "SNP") {
            if (verbose >= 2) {cat(
                    report(
                        "  Performing a PCA, individuals as entities, loci as attributes, SNP genotype as state\n"
                    )
                )}
                title <-
                    paste0("PCA on SNP Genotypes\nScree Plot\n (informative axes only -- ",pc.select," criterion)")
            }
            if (datatype == "SilicoDArT") {
                if (verbose >= 2) {cat(
                    report(
                        "  Performing a PCA, individuals as entities, loci as attributes, Tag P/A as state\n"
                    )
                )}
                title <-
                    paste0("PCA on Tag P/A Data\nScree Plot\n (informative axes only -- ",pc.select," criterion)")
            }
            
        pca <-
            glPca(x,
                  nf = nfactors,
                  parallel = parallel,
                  n.cores = n.cores)
        
        # # Identify the number of axes with explanatory value greater than the original variables on average
        # 
        # 
        # eig.raw.pos <- eig.raw[eig.raw >= 0]
        # eig.raw.pos.pc <- eig.raw.pos * 100 / sum(eig.raw.pos)
        # eig.top <- eig.raw.pos[eig.raw.pos >= mean(eig.raw.pos)]
        # eig.top.pc <- round(eig.top * 100 / sum(eig.raw.pos), 1)
        # eig.raw.noise <- eig.raw[eig.raw <= mean(eig.raw)]
        
        # Extract relevant variables
        eig.raw <- pca$eig

        # Identify the number of axes with explanatory value
        # Identify the number of axes with explanatory value
        
        if(pc.select=="Kaiser-Guttman"){
          eig.raw.pos <- eig.raw[eig.raw >= 0]
          eig.raw.pos.pc <- eig.raw.pos * 100 / sum(eig.raw.pos)
          eig.top <- eig.raw.pos[eig.raw.pos > mean(eig.raw.pos)] 
          eig.top.pc <- round(eig.top * 100 / sum(eig.raw.pos), 1)
          eig.raw.noise <- eig.raw[eig.raw <= mean(eig.raw)]
        } else if(pc.select=="broken-stick"){
          eig.raw.pos <- eig.raw[eig.raw >= 0]
          eig.raw.pos.pc <- eig.raw.pos * 100 / sum(eig.raw.pos)
          tmp <- bs.statistics(eigenvalues = eig.raw.pos,plot = T)
          eig.top <- tmp$struct$eigenvalues
          eig.top.pc <- round(eig.top * 100 / sum(eig.raw.pos), 1)
          eig.raw.noise <- tmp$noise$eigenvalues
        } else if(pc.select=="Tracy-Widom"){
          # M <- nInd(x)
          eig.raw.pos <- eig.raw[eig.raw >= 0]
          eig.raw.pos.pc <- eig.raw.pos * 100 / sum(eig.raw.pos)
          tmp <- tw.statistics(eig.raw.pos)
          eig.top <- tmp$struct$eigenvalues
          eig.top.pc <- round(eig.top * 100 / sum(eig.raw.pos), 1)
          eig.raw.noise <- tmp$noise$eigenvalues
        } else {
          stop("Error: incorrect specification for model of significant eigenvalues. Set to Tracy-Widom\n")
          pc.select <- "Tracy-Widom"
        }  
        
        if (any(eig.raw < 0) && verbose >= 2) {
          problem <- (-sum(eig.raw[eig.raw < 0]) / mean(eig.raw[1:3])) * 100
          cat(warn("  Warning: Some eigenvalues negative -- they sum to", round(problem, 2), "% of the mean eigenvalue for PCA axes 1-3\n"))
          cat(report("    Tolerable negative eigenvalues should sum to much less than the eigenvalues of displayed PCA axes (say, less than 20%)\n"))
          if (problem > 20) {
            cat(report("    If the stress (negative eigenvalues) is considered a problem, and you might reasonably choose to ignore it, you have the following options:\n"))
            cat(report("    (a) Apply more stringent filtering on Call Rate and repeat the PCA; or\n"))
            cat(report("    (b) Undertake a PCoA with an appropriate distance measure and a transformation (correction) to eliminate the negative eigenvalues; or\n"))
            cat(report("    (c) Interpret the visual representation of the ordination with caution, seeking corroborating evidence.\n"))
          }
        }
        
        e <- pca$eig[pca$eig > sum(pca$eig / length(pca$eig))]
        e <- eig.top
        e <- round(e * 100 / sum(pca$eig), 1)
        if (verbose >= 3) {
          cat(paste("  Ordination yielded", length(e), "informative dimensions(", pc.select, "criterion) from", nInd(x) - 1, "original dimensions\n"))
          cat(paste("    PCA Axis 1 explains", e[1], "% of the total variance\n"))
          cat(paste("    PCA Axis 1 and 2 combined explain", e[1] + e[2], "% of the total variance\n"))
          cat(paste("    PCA Axis 1-3 combined explain", e[1] + e[2] + e[3], "% of the total variance\n"))
        }
        
        # Construct a universal output file
        p.object <- list()
        p.object$scores <- pca$scores
        p.object$eig <- pca$eig
        p.object$loadings <- pca$loadings
        p.object$call <- match.call()
        
    }  #### END SNP ANALYSIS
    
    # PLOT THE DIAGNOSTICS
    
    # Plot Scree plot avoid no visible binding probl
    eigenvalue <- percent <- NULL
    
    df <-
        data.frame(eigenvalue = seq(1:length(eig.top.pc)), percent = eig.top.pc)
    if (datatype == "SNP") {
        xlab <- paste("PCA Axis")
    } else {
        xlab <- paste("PCoA Axis")
    }
    ylab <- paste("Percentage Contribution")
    
    p1 <-
      ggplot(df, aes(x = eigenvalue, y = percent)) + 
      geom_line(color = plot.colors[2]) + 
      geom_point(color = plot.colors[1], size = 4) +
        geom_hline(yintercept = 10, color = "blue") + 
      plot.theme + xlab(xlab) + 
      ylab(ylab) + 
      ggtitle(title)
    
    if (any(eig.raw < 0)) {
        main <- "Noise Axes -- Warning: some eigenvalues < 0, not shown"
    } else {
        main <- "Noise Axes -- all eigenvalues positive"
    }
    
    p2 <-
      ggplot(as.data.frame(eig.raw.noise), aes(x = eig.raw.noise)) + 
      geom_histogram(bins = 50, color = plot.colors[1], fill = plot.colors[2]) +
        geom_vline(xintercept = 0, color = "blue") +
      plot.theme + 
      xlab("Eigenvalue") + 
      ylab("Count") + 
      ggtitle(main)
    
    if(any(eig.raw.noise==0)){
      if(verbose>=2){
        cat(warn("Warning: Some eigenvalues are zero, which suggests redundancy in the SNP loci\n"))
        cat(warn("  This may arise if there are duplicate individuals or clones in the dataset. Worth checking.\n"))    
      }
    }
    
    # printing outputs
    p3 <- (p1 / p2)
    if (verbose >= 1) {
        if(plot.out){print(p3)}
    }

    # Optionally save the plot ---------------------
    
    if(!is.null(plot.file)){
      tmp <- utils.plot.save(p3,
                             dir=plot.dir,
                             file=plot.file,
                             verbose=verbose)
    }

    # FLAG SCRIPT END
    
    if (verbose > 0) {
        cat(report("Completed:", funname, "\n"))
    }
    
    class(p.object) <- "glPca"
    invisible(p.object)
}

Try the dartR.base package in your browser

Any scripts or data that you put into this service are public.

dartR.base documentation built on April 4, 2025, 2:45 a.m.