R/gl.pcoa.r
In dartR: Importing and Analysing 'SNP' and 'Silicodart' Data Generated by Genome-Wide Restriction Fragment Analysis

Documented in gl.pcoa

#'@name gl.pcoa
#'@title Ordination applied to genotypes in a genlight object (PCA), in an fd
#'object, or to a distance matrix (PCoA)
#'@description
#' This function takes the genotypes for individuals and undertakes a Pearson
#' Principal Component analysis (PCA) on SNP or Tag P/A (SilicoDArT) data; it
#' undertakes a Gower Principal Coordinate analysis (PCoA) if supplied with a
#' distance matrix. Technically, any distance matrix can be represented in an
#' ordinated space using PCoA.
#'
#' @param x Name of the genlight object or fd object containing the SNP data, or
#'  a distance matrix of type dist [required].
#' @param nfactors Number of axes to retain in the output of factor scores
#' [default 5].
#' @param correction Method applied to correct for negative eigenvalues, either
#'  'lingoes' or 'cailliez' [Default NULL].
#' @param mono.rm If TRUE, remove monomorphic loci [default TRUE].
#' @param parallel TRUE if parallel processing is required (does fail under
#' Windows) [default FALSE].
#' @param n.cores Number of cores to use if parallel processing is requested
#' [default 16].
#' @param plot.out If TRUE, a diagnostic plot is displayed showing a scree plot
#' for the "informative" axes and a histogram of eigenvalues of the remaining 
#' "noise" axes [Default TRUE].
#' @param plot_theme Theme for the plot. See Details for options
#'  [default theme_dartR()].
#' @param plot_colors List of two color names for the borders and fill of the
#'  plot [default two_colors].
#' @param save2tmp If TRUE, saves any ggplots and listings to the session
#' temporary directory (tempdir) [default FALSE].
#' @param verbose verbose= 0, silent or fatal errors; 1, begin and end; 2,
#' progress log; 3, progress and results summary; 5, full report
#' [default 2 or as specified using gl.set.verbosity].
#'
#' @details
#'The function is essentially a wrapper for glPca \code{adegenet} or pcoa \{ape\}
#'with default settings apart from those specified as parameters in this
#'function.
#'\strong{ Sources of stress in the visual representation }
#'
#' While, technically, any distance matrix can be represented in an ordinated
#' space, the representation will not typically be exact.There are three major
#' sources of stress in a reduced-representation of distances or dissimilarities
#' among entities using PCA or PCoA. By far the greatest source comes from the
#' decision to select only the top two or three axes from the ordinated set of
#' axes derived from the PCA or PCoA. The representation of the entities such a
#' heavily reduced space will not faithfully represent the distances in the
#' input distance matrix simply because of the loss of information in deeper
#' informative dimensions. For this reason, it is not sensible to be too
#' precious about managing the other two sources of stress in the visual
#' representation.
#'
#' The measure of distance between entities in a PCA is the Pearson Correlation
#' Coefficient, essentially a standardized Euclidean distance. This is both a
#' metric distance and a Euclidean distance. In PCoA, the second source of
#' stress is the choice of distance measure or dissimilarity measure. While any
#' distance or dissimilarity matrix can be represented in an ordinated space,
#' the distances between entities can be faithfully represented in that space
#' (that is, without stress) only if the distances are metric. Furthermore, for
#' distances between entities to be faithfully represented in a rigid Cartesian
#' space, the distance measure needs to be Euclidean. If this is not the case,
#' the distances between the entities in the ordinated visualized space will not
#' exactly represent the distances in the input matrix (stress will be non-zero).
#' This source of stress will be evident as negative eigenvalues in the deeper
#'  dimensions.
#'
#' A third source of stress arises from having a sparse dataset, one with
#' missing values. This affects both PCA and PCoA. If the original data matrix
#' is not fully populated, that is, if there are missing values, then even a
#' Euclidean distance matrix will not necessarily be 'positive definite'. It
#' follows that some of the eigenvalues may be negative, even though the
#' distance metric is Euclidean. This issue is exacerbated when the number of
#' loci greatly exceeds the number of individuals, as is typically the case when
#' working with SNP data. The impact of missing values can be minimized by
#' stringently filtering on Call Rate, albeit with loss of data. An alternative
#' is given in a paper 'Honey, I shrunk the sample covariance matrix' and more
#' recently by Ledoit and Wolf (2018), but their approach has not been
#'  implemented here.
#'
#' The good news is that, unless the sum of the negative eigenvalues, arising
#' from a non-Euclidean distance measure or from missing values, approaches
#' those of the final PCA or PCoA axes to be displayed, the distortion is
#' probably of no practical consequence and certainly not comparable to the
#' stress arising from selecting only two or three final dimensions out of
#' several informative dimensions for the visual representation.
#'
#'\strong{ Function's output }
#'
#' Two diagnostic plots are produced. The first is a Scree Plot, showing the
#' percentage variation explained by each of the PCA or PCoA axes, for those
#' axes that explain more than the original variables (loci) on average. That
#' is, only informative axes are displayed. The scree plot informs the number of
#' dimensions to be retained in the visual summaries. As a rule of thumb, axes
#' with more than 10% of variation explained should be included.
#'
#' The second graph shows the distribution of eigenvalues for the remaining
#' uninformative (noise) axes, including those with negative eigenvalues.
#'
#' Action is recommended (verbose >= 2) if the negative eigenvalues are
#' dominant, their sum approaching in magnitude the eigenvalues for axes
#' selected for the final visual solution.
#'
#' Output is a glPca object conforming to adegenet::glPca but with only the
#' following retained.
#'\itemize{
#'\item  $call - The call that generated the PCA/PCoA
#'\item  $eig - Eigenvalues -- All eigenvalues (positive, null, negative).
#'\item  $scores - Scores (coefficients) for each individual
#'\item  $loadings - Loadings of each SNP for each principal component
#'    }
#'
#' Plots and table were saved to the temporal directory (tempdir) and can be
#' accessed with the function \code{\link{gl.print.reports}} and listed with
#' the function \code{\link{gl.list.reports}}. Note that they can be accessed
#' only in the current R session because tempdir is cleared each time that the R
#' session is closed.
#'
#'  Examples of other themes that can be used can be consulted in \itemize{
#'  \item \url{https://ggplot2.tidyverse.org/reference/ggtheme.html} and \item
#'  \url{https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/}
#'  }
#'
#' PCA was developed by Pearson (1901) and Hotelling (1933), whilst the best
#' modern reference is Jolliffe (2002). PCoA was developed by Gower (1966) while
#' the best modern reference is Legendre & Legendre (1998).
#'
#'@return An object of class pcoa containing the eigenvalues and factor scores
#'@author Author(s): Arthur Georges. Custodian: Arthur Georges (Post to
#'\url{https://groups.google.com/d/forum/dartr})
#'@examples
#' \dontrun{
#' gl <- possums.gl
#' # PCA (using SNP genlight object)
#' pca <- gl.pcoa(possums.gl[1:50,],verbose=2)
#' gl.pcoa.plot(pca,gl)
#' 
#' gs <- testset.gs
#' levels(pop(gs))<-c(rep('Coast',5),rep('Cooper',3),rep('Coast',5),
#' rep('MDB',8),rep('Coast',6),'Em.subglobosa','Em.victoriae')
#' 
#' # PCA (using SilicoDArT genlight object)
#' pca <- gl.pcoa(gs)
#' gl.pcoa.plot(pca,gs)
#' 
#' # Collapsing pops to OTUs using Fixed Difference Analysis (using fd object)
#' fd <- gl.fixed.diff(testset.gl)
#' fd <- gl.collapse(fd)
#' pca <- gl.pcoa(fd)
#' gl.pcoa.plot(pca,fd$gl)
#' 
#' # Using a distance matrix
#' D <- gl.dist.ind(testset.gs, method='jaccard')
#' pcoa <- gl.pcoa(D,correction="cailliez")
#' gl.pcoa.plot(pcoa,gs)
#' }
#'@references
#'\itemize{
#'\item Cailliez, F. (1983) The analytical solution of the additive constant
#'problem. Psychometrika, 48, 305-308.
#'\item Gower, J. C. (1966) Some distance properties of latent root and vector
#'methods used in multivariate analysis. Biometrika, 53, 325-338.
#'\item Hotelling, H., 1933. Analysis of a complex of statistical variables into
#' Principal Components. Journal of Educational Psychology 24:417-441, 498-520.
#'\item Jolliffe, I. (2002) Principal Component Analysis. 2nd Edition, Springer,
#' New York.
#'\item Ledoit, O. and Wolf, M. (2018). Analytical nonlinear shrinkage of
#'large-dimensional covariance matrices. University of Zurich, Department of
#'Economics, Working Paper No. 264, Revised version. Available at SSRN:
#'https://ssrn.com/abstract=3047302 or http://dx.doi.org/10.2139/ssrn.3047302
#'\item Legendre, P. and Legendre, L. (1998). Numerical Ecology, Volume 24, 2nd
#'Edition. Elsevier Science, NY.
#'\item Lingoes, J. C. (1971) Some boundary conditions for a monotone analysis
#'of symmetric matrices. Psychometrika, 36, 195-203.
#'\item Pearson, K. (1901). On lines and planes of closest fit to systems of
#'points in space. Philosophical Magazine. Series 6, vol. 2, no. 11, pp.
#'559-572.
#' }
#' @seealso \code{\link{gl.pcoa.plot}}
#' @family data exploration functions
#' @importFrom ape pcoa
#' @export

gl.pcoa <- function(x,
                    nfactors = 5,
                    correction = NULL,
                    mono.rm = TRUE,
                    parallel = FALSE,
                    n.cores = 16,
                    plot.out = TRUE,
                    plot_theme = theme_dartR(),
                    plot_colors = two_colors,
                    save2tmp = FALSE,
                    verbose = NULL) {
    # SET VERBOSITY
    verbose <- gl.check.verbosity(verbose)
    
    # FLAG SCRIPT START
    funname <- match.call()[[1]]
    utils.flag.start(func = funname,
                     build = "Josh",
                     verbose = verbose)
    
    # CHECK DATATYPE
    datatype <-
        utils.check.datatype(x,
                             accept = c("SNP", "SilicoDArT", "dist", "fd"),
                             verbose = verbose)
    
    if (datatype == "SNP" | datatype == "SilicoDArT") {
        x <- gl.filter.allna(x, verbose = 0)
    }
    
    if (datatype == "fd") {
        datatype <- utils.check.datatype(x$gl, verbose = verbose)
        x <- x$gl
        x <- gl.filter.allna(x, verbose = 0)
    }
    
    # SCRIPT SPECIFIC ERROR CHECKING
    
    if (mono.rm == TRUE &
        (datatype == "SNP" | datatype == "SilicoDArT")) {
        x <- gl.filter.monomorphs(x, verbose = 0)
    }
    
    if((datatype == "SNP" | datatype == "SilicoDArT")){
        if (nInd(x) < 2) {
        stop(
            error(
                "Fatal Error: Only one individual or no individuals present in the dataset\n"
            )
        )
        }
   
        if (nLoc(x) < nInd(x)) {
        if(verbose >=2){cat(
            warn(
                "  Warning: Number of loci is less than the number of individuals to be represented\n"
            )
        )}
        }
    }
    
    if (is.null(correction)) {
        correction <- "none"
    } else {
        correction <- tolower(correction)
        if (correction != "lingoes" &&
            correction != "cailliez") {
            if (verbose >= 2) {
                cat(
                    warn(
                        "  Warning: Correction if specified needs to be lingoes or cailliez, set to the default 'None'\n"
                    )
                )
            }
            correction <- "none"
        }
    }
    
    # DO THE JOB
    
    ######## DISTANCE ANALYSIS
    
    if (datatype == "dist")
    {
        D <- x
        
        # Calculate the pcoa
        if (verbose >= 2) {
            if (correction == "none") {
                cat(
                    report(
                        "  Performing a PCoA, individuals as entities, no correction applied\n"
                    )
                )
                title <-
                    "PCoA on Distance Matrix (no correction)\nScree Plot (informative axes only)"
            } else {
                cat(
                    report(
                        "  Performing a PCoA, individuals as entities,",
                        correction,
                        "correction for stress (-ive eigenvalues) applied\n"
                    )
                )
                title <-
                    paste0(
                        "PCoA on Distance Matrix(",
                        correction,
                        ")\nScree Plot (informative axes only)"
                    )
            }
        }
        pco <-
            ape::pcoa(D, correction = correction, rn = labels(D))
        
        # Extract relevant variables
        
        if (correction == "none") {
            eig.raw <- pco$values$Eigenvalues
        } else {
            eig.raw <- pco$values$Corr_eig
        }
        
        # Identify the number of axes with explanatory value greater than the original variables on average
        eig.raw.pos <- eig.raw[eig.raw >= 0]
        eig.raw.pos.pc <- eig.raw.pos * 100 / sum(eig.raw.pos)
        eig.top <- eig.raw.pos[eig.raw.pos > mean(eig.raw.pos)]
        eig.top.pc <- round(eig.top * 100 / sum(eig.raw.pos), 1)
        eig.raw.noise <- eig.raw[eig.raw <= mean(eig.raw)]
        
        if (any(eig.raw < 0)) {
            if (verbose >= 2) {
                problem <- (-sum(eig.raw[eig.raw < 0]) / mean(eig.raw[1:3])) * 100
                cat(
                    warn(
                        "  Warning: Some eigenvalues negative -- sum to",
                        round(problem, 2),
                        "% of the mean eigenvalue for PCoA axes 1-3\n"
                    )
                )
                cat(
                    report(
                        "    Tolerable negative eigenvalues should sum to much less than the eigenvalues of displayed PCoA axes (say, less than 20%)\n"
                    )
                )
                if (problem > 20) {
                    cat(
                        report(
                            "    If the stress (negative eigenvalues) is considered a problem, and you might reasonably choose to ignore it, you have the following options:\n"
                        )
                    )
                    cat(
                        report(
                            "    (a) Apply more stringent filtering on Call Rate before generating the distance matrix; or\n"
                        )
                    )
                    cat(
                        report(
                            "    (b) Select an alternate distance measure, preferably a metric distance or better still, a Euclidean distance, if you have not already; or\n"
                        )
                    )
                    cat(
                        report(
                            "    (c) Apply a transformation (correction) to eliminate the negative eigenvalues. If this was already done, try another correction; or\n"
                        )
                    )
                    cat(
                        report(
                            "    (d) Interpret the visual representation of the ordination with caution, seeking corroborating evidence.\n"
                        )
                    )
                }
            }
        }
        
        # Provide a summary
        if (verbose >= 3) {
            if (correction == "lingoes" | correction == "cailliez") {
                cat(" Correction",
                    correction,
                    "applied to remove negative eigenvalues\n")
                cat(
                    paste(
                        "  Uncorrected ordination yielded",
                        length(eig.top),
                        "informative dimensions from",
                        nInd(x) - 1,
                        "original dimensions\n"
                    )
                )
            } else {
                cat(
                    paste(
                        "  Ordination yielded",
                        length(eig.top),
                        "informative dimensions from",
                        dim(as.matrix(D))[1],
                        "original dimensions\n"
                    )
                )
            }
            cat(paste(
                "    PCoA Axis 1 explains",
                round(eig.raw.pos.pc[1], 1),
                "% of the total variance\n"
            ))
            cat(
                paste(
                    "    PCoA Axis 1 and 2 combined explain",
                    round(eig.raw.pos.pc[1] + eig.raw.pos.pc[2], 1),
                    "% of the total variance\n"
                )
            )
            cat(paste(
                "    PCoA Axis 1-3 combined explain",
                round(
                    eig.raw.pos.pc[1] + eig.raw.pos.pc[2] + eig.raw.pos.pc[3],
                    1
                ),
                "% of the total variance\n"
            ))
        }
        # Construct a universal output file
        p.object <- list()
        p.object$scores <- pco$vectors[, 1:nfactors]
        p.object$eig <- pco$values$Eigenvalues
        p.object$loadings <- pco$vectors.cor[, 1:nfactors]
        p.object$call <- match.call()
        
    }  ######## END DISTANCE DATA
    
    ######## SNP or P/A DATA, PCA
    
    if (datatype == "SNP" || datatype == "SilicoDArT")
    {
        if (datatype == "SNP") {
            if (verbose >= 2) {cat(
                    report(
                        "  Performing a PCA, individuals as entities, loci as attributes, SNP genotype as state\n"
                    )
                )}
                title <-
                    "PCA on SNP Genotypes\nScree Plot (informative axes only)"
            }
            if (datatype == "SilicoDArT") {
                if (verbose >= 2) {cat(
                    report(
                        "  Performing a PCA, individuals as entities, loci as attributes, Tag P/A as state\n"
                    )
                )}
                title <-
                    "PCA on Tag P/A Data\nScree Plot (informative axes only)"
            }
            
        pca <-
            glPca(x,
                  nf = nfactors,
                  parallel = parallel,
                  n.cores = n.cores)
        
        # Identify the number of axes with explanatory value greater than the original variables on average
        eig.raw <- pca$eig
        
        eig.raw.pos <- eig.raw[eig.raw >= 0]
        eig.raw.pos.pc <- eig.raw.pos * 100 / sum(eig.raw.pos)
        eig.top <- eig.raw.pos[eig.raw.pos >= mean(eig.raw.pos)]
        eig.top.pc <- round(eig.top * 100 / sum(eig.raw.pos), 1)
        eig.raw.noise <- eig.raw[eig.raw <= mean(eig.raw)]
        
        if (any(eig.raw < 0)) {
            if (verbose >= 2) {
                problem <- (-sum(eig.raw[eig.raw < 0]) / mean(eig.raw[1:3])) * 100
                cat(
                    warn(
                        "  Warning: Some eigenvalues negative -- sum to",
                        round(problem, 2),
                        "% of the mean eigenvalue for PCA axes 1-3\n"
                    )
                )
                cat(
                    report(
                        "    Tolerable negative eigenvalues should sum to much less than the eigenvalues of displayed PCA axes (say, less than 20%)\n"
                    )
                )
                if (problem > 20) {
                    cat(
                        report(
                            "    If the stress (negative eigenvalues) is considered a problem, and you might reasonably choose to ignore it, you have the following options:\n"
                        )
                    )
                    cat(
                        report(
                            "    (a) Apply more stringent filtering on Call Rate and repeat the PCA; or\n"
                        )
                    )
                    cat(
                        report(
                            "    (b) Undertake a PCoA with an appropriate distance measure and a transformation (correction) to eliminate the negative eigenvalues; or\n"
                        )
                    )
                    cat(
                        report(
                            "    (c) Interperate the visual representation of the ordination with caution, seeking corroborating evidence.\n"
                        )
                    )
                }
            }
        }
        
        e <- pca$eig[pca$eig > sum(pca$eig / length(pca$eig))]
        e <- round(e * 100 / sum(pca$eig), 1)
        if (verbose >= 3) {
            cat(
                paste(
                    "  Ordination yielded",
                    length(e),
                    "informative dimensions from",
                    nInd(x) - 1,
                    "original dimensions\n"
                )
            )
            cat(paste("    PCA Axis 1 explains", e[1], "% of the total variance\n"))
            cat(paste(
                "    PCA Axis 1 and 2 combined explain",
                e[1] + e[2],
                "% of the total variance\n"
            ))
            cat(
                paste(
                    "    PCA Axis 1-3 combined explain",
                    e[1] + e[2] + e[3],
                    "% of the total variance\n"
                )
            )
        }
        # Construct a universal output file
        p.object <- list()
        p.object$scores <- pca$scores
        p.object$eig <- pca$eig
        p.object$loadings <- pca$loadings
        p.object$call <- match.call()
        
    }  #### END SNP ANALYSIS
    
    # PLOT THE DIAGNOSTICS
    
    # Plot Scree plot avoid no visible binding probl
    eigenvalue <- percent <- NULL
    
    df <-
        data.frame(eigenvalue = seq(1:length(eig.top.pc)), percent = eig.top.pc)
    if (datatype == "SNP") {
        xlab <- paste("PCA Axis")
    } else {
        xlab <- paste("PCoA Axis")
    }
    ylab <- paste("Percentage Contribution")
    
    p1 <-
        ggplot(df, aes(x = eigenvalue, y = percent)) + 
      geom_line(color = plot_colors[2]) + 
      geom_point(color = plot_colors[1], size = 4) +
        geom_hline(yintercept = 10, color = "blue") + 
      plot_theme + xlab(xlab) + 
      ylab(ylab) + 
      ggtitle(title)
    
    if (any(eig.raw < 0)) {
        main <- "Noise Axes -- Warning: some eigenvalues < 0"
    } else {
        main <- "Noise Axes -- all eigenvalues positive"
    }
    
    p2 <-
        ggplot(as.data.frame(eig.raw.noise), aes(x = eig.raw.noise)) + 
      geom_histogram(bins = 50, color = plot_colors[1], fill = plot_colors[2]) +
        geom_vline(xintercept = 0, color = "blue") +
      plot_theme + 
      xlab("Eigenvalue") + 
      ylab("Count") + 
      ggtitle(main)
    
    # printing outputs
    p3 <- (p1 / p2)
    if (verbose >= 1) {
        if(plot.out){print(p3)}
    }
    
    # SAVE INTERMEDIATES TO TEMPDIR
    if (save2tmp) {
        # creating temp file names
        temp_plot <- tempfile(pattern = "Plot_")
        match_call <-
            paste0(names(match.call()),
                   "_",
                   as.character(match.call()),
                   collapse = "_")
        # saving to tempdir
        saveRDS(list(match_call, p3), file = temp_plot)
        if (verbose >= 2) {
            cat(report("  Saving ggplot(s) to the session tempfile\n"))
            cat(
                report(
                    "  NOTE: Retrieve output files from tempdir using gl.list.reports() and gl.print.reports()\n"
                )
            )
        }
    }
    
    # FLAG SCRIPT END
    
    if (verbose > 0) {
        cat(report("Completed:", funname, "\n"))
    }
    
    class(p.object) <- "glPca"
    invisible(p.object)
}
Any scripts or data that you put into this service are public.
dartR documentation built on April 4, 2025, 12:23 a.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
dartR
Importing and Analysing 'SNP' and 'Silicodart' Data Generated by Genome-Wide Restriction Fragment Analysis

R/gl.pcoa.r
In dartR: Importing and Analysing 'SNP' and 'Silicodart' Data Generated by Genome-Wide Restriction Fragment Analysis

Defines functions gl.pcoa

Documented in gl.pcoa

Try the dartR package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

dartR Importing and Analysing 'SNP' and 'Silicodart' Data Generated by Genome-Wide Restriction Fragment Analysis

R/gl.pcoa.r In dartR: Importing and Analysing 'SNP' and 'Silicodart' Data Generated by Genome-Wide Restriction Fragment Analysis

Defines functions gl.pcoa

Documented in gl.pcoa

Try the dartR package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

dartR
Importing and Analysing 'SNP' and 'Silicodart' Data Generated by Genome-Wide Restriction Fragment Analysis

R/gl.pcoa.r
In dartR: Importing and Analysing 'SNP' and 'Silicodart' Data Generated by Genome-Wide Restriction Fragment Analysis