R/smart_permdisp.R
In smartsnp: Fast Multivariate Analyses of Big Genomic Data

Documented in smart_permdisp

# smartsnp v.1
# Coding start date = 07/07/2020
# smartsnp::smart_permdisp by Salvador Herrando-Perez (salherra@gmail.com) and Ray Tobler (tingalingx@gmail.com)
#
# Permutational Multivariate Analysis of dispersion (PERMDISP) of genotype data
# Computes global and pairwise MANOVA tests for differences in GROUP DISPERSION for multiple types of proximity measures between samples
# Alpha values estimated by permutations
# Implements scaling by centering, z-scored and SMARTPCA controlling for genetic drift
# Original description of PERMDISP at https://doi.org/10.1111/j.1461-0248.2006.00926.x / Anderson, Ellingsen & McArdle (2006) Multivariate dispersion as a measure of beta diversity. Ecology Letters, 9, 683-693

#' @name smart_permdisp
#' @title Smart Permutational Multivariate Analysis of Dispersion
#'
#' @description Computes Permutational Multivariate Analysis of Dispersion (PERMDISP) in group dispersion using multivariate data.
#' Variance partitioning computed on a sample-by-sample triangular matrix obtained from variable-by-sample data following Anderson (2006).
#' Calculates a range of inter-sample distances, similarities and dissimilarities. Includes control for genetic drift for bi-allelic genetic markers including single nucleotide polymorphisms (SNP) following Patterson, Price and Reich (2006) that can be combined with SMART Principal Component Analysis (PCA).
#' Optimized to run fast matrix building and permutations for big datasets in ecological, evolutionary and genomic research.\cr
#'
#' @details PERMDISP is a form of \code{homoscedasticity} test analogous to univariate Levene's (1960) and, more closely, Brown & Forsythe's (1974) tests. Applies PERMANOVA test (Anderson 2001) for differences in Euclidean dispersion among groups (Anderson 2006).
#' Proximity between samples can be any type of distance, similarity or dissimilarity.
#' Group dispersion estimated relative to group centroids (central point) or to spatial (geometric) medians (point minimizing distance to group samples) in Principal Coordinate Analysis (\code{PCoA}, Gower 1966) space.
#' Acronym PERMDISP originates from Marti Anderson's FORTRAN program (Anderson 2004).
#'
#' Control for unequal number of samples among groups optionally done by weighting sample distance to group spatial \emph{median} or  \emph{centroid} by \emph{sqrt(n/(n-1))} (O'Neill & Mathews 2000), and the null hypothesis then being tested changes from \emph{d_1 = d_2  = ... = d_t} (balanced design) to  \emph{((n_1-1)/n_1) x d_1 = ((n_2-1)/n_2) x d_2  = ... =  ((n_t-1)/n_t) x d_t} (unbalanced design) where \emph{d} represents dispersion of groups \emph{1} to \emph{t}, and \emph{n} represents number of samples per group.
#' To attribute group differences to location (position of sample groups) and/or dispersion (spread of sample groups), PERMDISP must be combined with PERMANOVA as implemented through \code{smart_permanova}.\cr
#'
#' \code{smart_permdisp} uses \code{\link[vegan]{betadisper}} to estimate an ANOVA \emph{F} statistic and group dispersions using formula \code{snp_eucli ~ sample_group}, where \code{snp_eucli} is the sample-by-sample triangular matrix in \code{PCoA} space.
#' If >2 sample groups tested, \code{pairwise = TRUE} allows pairwise testing and correction for multiple testing by \code{holm (Holm)} [default], \code{hochberg (Hochberg)}, \code{hommel (Hommel)}, \code{bonferroni (Bonferroni)}, \code{BY (Benjamini-Yekuieli)}, \code{BH (Benjamini-Hochberg)} or \code{fdr (False Discovery Rate)}.\cr
#'
#' For big data, \code{\link[Rfast]{Dist}} builds sample-by-sample triangular matrix much faster than \code{\link[vegan]{vegdist}}. \code{\link[Rfast]{Dist}} computes proximities \code{euclidean}, \code{manhattan}, \code{canberra1}, \code{canberra2}, \code{minimum}, \code{maximum}, \code{minkowski}, \code{bhattacharyya}, \code{hellinger}, \code{kullback_leibler} and \code{jensen_shannon}. \code{\link[vegan]{vegdist}} computes \code{manhattan}, \code{euclidean}, \code{canberra}, \code{clark}, \code{bray}, \code{kulczynski}, \code{jaccard}, \code{gower}, \code{altGower}, \code{morisita}, \code{horn}, \code{mountford}, \code{raup}, \code{binomial}, \code{chao}, \code{cao} and \code{mahalanobis}.
#' Euclidean distance required for SMARTPCA scaling.\cr
#'
#' \code{sample_remove} should include both samples removed from PCA and ancient samples projected onto PCA space (if any).\cr
#'
#' Data read from working directory with SNPs as rows and samples as columns. Two alternative formats: (1) text file of SNPs by samples (file extension and column separators recognized automatically) read using \code{\link[data.table]{fread}}; or (2) duet of \code{EIGENSTRAT} files (see \url{https://reich.hms.harvard.edu/software}) using \code{\link[vroom]{vroom_fwf}}, including a genotype file of SNPs by samples \code{*.geno}, and a sample file (\code{*.ind}) containing three vectors assigning individual samples to unique user-predefined groups (populations), sexes (or other user-defined descriptor) and alphanumeric identifiers. For \code{EIGENSTRAT}, vector \code{sample_group} assigns samples to groups retrievable from column 3 of file \code{*.ind}.
#' SNPs with zero variance removed prior to SVD to optimize computation time and avoid undefined values if \code{scaling = "sd"} or \code{"drift"}.\cr
#'
#' Users can select subsets of samples or SNPs by introducing a vector including column numbers for samples (\code{sample_remove}) and/or row numbers for SNPs (\code{snp_remove}) to be removed from computations.
#' Function stops if the final number of SNPs is 1 or 2.
#' \code{EIGENSOFT} was conceived for the analysis of human genes and its \code{SMARTPCA} suite so accepts 22 (autosomal) chromosomes by default.
#' If >22 chromosomes are provided and the internal parameter \code{numchrom} is not set to the target number chromosomes of interest, SMARTPCA automatically subsets chromosomes 1 to 22.
#' In contrast, \code{smart_permdisp} accepts any number of autosomes with or without the sex chromosomes from an \code{EIGENSTRAT} file.\cr
#'
#' @param snp_data File name read from working directory. SNP = rows, samples = columns without row names or column headings.
#' SNP values must be count data (no decimals allowed).
#' File extension detected automatically whether text or \code{EIGENSTRAT}.
#' See details.
#' @param packed_data Logical value for \code{EIGENSTRAT}, irrelevant for text data.
#' Default \code{packed_data = FALSE} assumes uncompressed \code{EIGENSTRAT}.
#' \code{packed_data = TRUE} for compressed or binary \code{EIGENSTRAT} (\code{PACKENDANCESTRYMAP}).
#' @param sample_group Character or numeric vector assigning samples to groups. Coerced to factor.
#' @param sample_remove Logical \code{FALSE} or numeric vector indicating column numbers (samples) to be removed from computations.
#' Default \code{sample_remove =  FALSE} keeps all samples.
#' @param snp_remove Logical \code{FALSE} or numeric vector indicating row numbers (SNPs) to be removed from computations.
#' Default \code{snp_remove =  FALSE} keeps all SNPs.
#' See details.
#' @param missing_value Number \code{9} or string \code{NA} indicating missing value.
#' Default \code{missing_value = 9} as in \code{EIGENSTRAT}.
#' If no missing values present, no effect on computation.
#' @param missing_impute String handling missing values.
#' Default \code{missing_impute = "mean"} replaces missing values of each SNP by mean of non-missing values across samples.
#' \code{missing_impute = "remove"} removes SNPs with at least one missing value.
#' If no missing values present, no effect on computation.
#' @param scaling String. Default \code{scaling = "drift"} scales SNPs to control for expected allele frequency dispersion caused by genetic drift (SMARTPCA). \code{scaling = "center"} for \code{centering} (covariance-based PCA). \code{scaling = "sd"} for \code{centered} SNPs divided by standard deviation (correlation-based PCA).
#' \code{scaling = "none"} for no scaling.
#' See details.
#' @param sample_distance Type of inter-sample proximity computed (distance, similarity, dissimilarity).
#' Default is \code{Euclidean distance}.
#' See details.
#' @param program_distance A string value indicating R package to estimate proximities between pairs of samples.
#' Default \code{program_distance = "Rfast"} uses function \code{\link[Rfast]{Dist}}; \code{program_distance = "vegan"} uses \code{\link[vegan]{vegdist}}.
#' See details.
#' @param target_space String.
#' Default \code{target_space = "multidimensional"} applies PERMANOVA to sample-by-sample triangular matrix computed from variable-by-sample data, \code{pc_axes} has no effect on computation.
#' \code{target_space = "pca"} applies PERMANOVA to sample-by-sample data in PCA space, \code{pc_axes} determines number of PCA axes for testing.
#' @param pc_axes Number of PCA axes computed always starting with PCA axis 1.
#' Default \code{pc_axes = 2} computes PCA axes 1 and 2 if \code{target_space = "pca"}.
#' No effect on computation if \code{target_space = "multidimensional"}.
#' @param pairwise Logical.
#' Default \code{pairwise = FALSE} computes global test.
#' \code{pairwise = TRUE} computes global and pairwise tests.
#' @param pairwise_method String specifying type of correction for multiple testing.
#' Default \code{"holm"}.
#' See details.
#' @param permutation_n Number of permutations resulting in PERMDISP test \emph{p value}.
#' Default \code{9999}.
#' @param permutation_seed Number fixing random generator of permutations.
#' Default \code{1}.
#' @param dispersion_type String indicating quantification of group dispersion whether relative to spatial \code{"median"} or \code{"centroid"}.
#' Default \code{"median"}. See details.
#' @param samplesize_bias Logical.
#' \code{samplesize_bias = TRUE} for dispersion weighted by number of samples per group.
#' Default \code{samplesize_bias = FALSE} for no weighting.
#' See details.
#'
#' @return Returns a list containing the following elements:
#' \itemize{
#'   \item \code{permdisp.samples}: Dataframe showing sample summary. Column \emph{Group} assigns samples to tested groups. Column \emph{Class} specifies if samples were used in, or removed from, testing (PERMDISP). Column \emph{Sample_dispersion} shows dispersion of individual samples relative to the spatial \code{"median"} or \code{"centroid"}.
#'   \item \code{permdisp.bias}: String indicating if PERMDISP dispersions were corrected for the number of samples per group.
#'   \item \code{permdisp.group_location}: Dataframe showing coordinates of spatial \code{"median"} or \code{"centroid"} per group.
#'   \item \code{permdisp.global_test}: List showing table with degrees of freedom, sum of squares, mean sum of squares, \emph{F} statistic, and \emph{p} value.
#'   \item \code{permdisp.pairwise_test}: List showing table with \emph{F} statistic, \emph{p} value, and corrected \emph{p} value per pair of groups. Obtained only if \code{pairwise = TRUE}.
#'   \item \code{permdisp.pairwise_correction}: String indicating type of correction for multiple testing.
#'   \item \code{permdisp.permutation_number}: Number of permutations applied to obtain the distribution of the \emph{F} statistic.
#'   \item \code{permdisp.permutation_seed}: Number fixing the random generator of permutations for reproducibility of results.
#' }
#'
#' @examples
#' # Path to example genotype matrix "dataSNP"
#' pathToGenoFile = system.file("extdata", "dataSNP", package = "smartsnp")
#'
#' # Assign 50 samples to each of two groups and colours
#' my_groups <- as.factor(c(rep("A", 50), rep("B", 50))); cols = c("red", "blue")
#'
#' # Run PERMDISP
#' permdispR <- smart_permdisp(snp_data = pathToGenoFile, sample_group = my_groups)
#'
#' # Extract summary table assigning samples to groups and dispersion of individual samples
#' permdispR$permdisp.samples
#'
#' # Extract PERMDISP table
#' permdispR$permdisp.global_test
#'
#' # Plot sample distances to group central medians
#' #run pca with truncated SVD (PCA 1 x PCA 2)
#' pcaR1 <- smart_pca(snp_data = pathToGenoFile, sample_group = my_groups)
#' #compute Euclidean inter-sample distances in PCA space (triangular matrix)
#' snp_eucli <- vegan::vegdist(pcaR1$pca.sample_coordinates[,c("PC1","PC2")], method = "euclidean")
#' #calculate spatial medians
#' disMed <- vegan::betadisper(d = snp_eucli, group = my_groups); disMed
#' #plot
#' oldpar <- par(mar = c(4, 4, 5, 0.1), lwd = 2)
#' boxplot(disMed, las =2, cex.axis = 2, cex.main = 1.5, horizontal = TRUE, varwidth = TRUE,
#'    col = cols, xlab = "", ylab = "", main = "Sample distance to group spatial medians")
#' par(oldpar)
#'
#' @references
#' Anderson, M. J. (2001) A new method for non-parametric multivariate analysis of variance. Austral Ecology, 26, 32-46.\cr
#' Anderson, M. J. (2004). PERMANOVA_2factor: a FORTRAN computer program for permutational multivariate analysis of variance (for any two-factor ANOVA design) using permutation tests  (Department of Statistics, University of Auckland, New Zealand).\cr
#' Anderson, M. J. & D. C. I. Walsh (2013) PERMANOVA, ANOSIM, and the Mantel test in the face of heterogeneous dispersions: What null hypothesis are you testing? Ecological Monographs, 83, 557-574.\cr
#' Gower, J. C. (1966) Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53, 325-338.\cr
#' McArdle, B. H. & M. J. Anderson (2001) Fitting multivariate models to community data: a comment on distance-based redundancy analysis. Ecology, 82, 290-297.\cr
#' Patterson, N., A. L. Price and D. Reich (2006) Population structure and eigenanalysis. PLoS Genetics, 2, e190.\cr
#' Warton, D. I., S. T. Wright and Y. Wang (2012) Distance-based multivariate analyses confound location and dispersion effects. Methods in Ecology and Evolution, 3, 89-101.
#'
#' @seealso \code{\link[vegan]{adonis2}} (package \bold{vegan}),
#' \code{\link[Rfast]{Dist}} (package \bold{Rfast}),
#' \code{\link[data.table]{fread}} (package \bold{data.table}),
#' \code{\link[vegan]{vegdist}} (package \bold{vegan}),
#' \code{\link[vroom]{vroom_fwf}} (package \bold{vroom})
#'
#' @importFrom data.table :=
#' @importFrom data.table .N
#' @importFrom foreach %do%
#' @export
utils::globalVariables(c("i"))
smart_permdisp <- function(snp_data, packed_data = FALSE,
                           sample_group, sample_remove = FALSE, snp_remove = FALSE,
                           missing_value = 9, missing_impute = "mean",
                           scaling = "drift",
                           sample_distance = "euclidean", program_distance = "Rfast",
                           target_space = "multidimensional", pc_axes = 2,
                           pairwise = FALSE, pairwise_method = "holm",
                           permutation_n = 9999, permutation_seed = 1,
                           dispersion_type = "median", samplesize_bias = FALSE) {

  ##--------------------------------------------------------------##
  # 1. Timing analytical steps
  ##--------------------------------------------------------------##

  # Overall starting time
  startT <- Sys.time()

  # cats cumulative run time at specific points
  get.time <- function(t) { # time conversion
    tt <- as.numeric(difftime(Sys.time(), t, units = "secs"))
    H <- tt %/% 3600
    rr <- tt %% 3600
    if(rr > 0) {
      M <- rr %/% 60
      rr2 <- rr %% 60
      if(rr2 > 0) {
        S <- round(rr2)
      }else{
        S <- 0
      }
    } else {
      M <- 0
      S <- 0
    }
    return(paste0(H, "h ", M, "m ", S, "s"))
  }

  ##--------------------------------------------------------------##
  # 2. Load data and filter samples and SNPs
  ##--------------------------------------------------------------##

  message("Checking argument options selected...")
  # Check options: convert to logical if needed
  if (!is.numeric(sample_remove)) sample_remove <- as.logical(sample_remove)
  # Print error and abort analysis
  if (missing_impute != "remove" & missing_impute != "mean") { # if users misspel missing_impute
    stop("Check spelling: missing_impute must be 'remove' or 'mean'")
  }
  if (missing_value != 9 & !is.na(missing_value)) { # if users assign missing values not equal to 9 or NA
    stop("Missing values must be coded as 9 or NA")
  }
  if (scaling != "none" & scaling != "center" & scaling != "sd" & scaling != "drift") { # # if users misspel scaling
    stop("Check spelling: scaling must be 'none', 'center', 'sd' or 'drift'")
  }
  if (program_distance != "Rfast" & program_distance != "vegan") { # if users misspel program_distance
    stop("Check spelling: program_distance must be 'Rfast' or 'vegan'")
  }
  if (target_space != "multidimensional" & target_space != "pca") { # if users misspel target_space
    stop("Check spelling: target_space must be 'multidimensional' or 'pca'")
  }
  message("Argument options are correct...")

  message("Loading data...")

  # Label samples
  samp_dat <- data.table::data.table(sample_group) # tabulate group labels
  data.table::setnames(samp_dat, "Group") # name column vector of group labels
  samp_dat[, I:= 1:.N] # cbind column vectors with samples listed from 1 to length(sample_group)
  sampN.full <- nrow(samp_dat) # total number of samples in dataset

  # Define samples used for PERMDISP
  sample_PCA <- 1:sampN.full # assume all samples for PERMDISP as default
  if (!isFALSE(sample_remove)) {
    sample_PCA <- setdiff(sample_PCA, sample_remove) # remove excluded samples
  }
  # Recalculate number of samples for PERMDISP
  sampN <- length(sample_PCA)

  # Avoid error if users unfamiliar with data formats set packed_data = FALSE but their data is binary (packedancestrymap format) following https://stackoverflow.com/questions/16350164/native-method-in-r-to-test-if-file-is-ascii
  is.binary <- function(filepath, max = 1000) {
    f = file(filepath, "rb", raw = TRUE)
    b = readBin(f, "int", max, size = 1, signed = FALSE)
    close(f)
    return(max(b) > 128)
  }
  if (is.binary(snp_data) == TRUE) {
    packed_data = TRUE
    message("Data is binary (packedancestrymap)...")
  } else {
    packed_data = FALSE
  }

  # Read in data (columns = SNPs, rows = samples)
  if(length(grep("\\.geno$", snp_data)) == 1) { # eigenstrat format
    if(isFALSE(packed_data)) { # decompressed format
      snp_dat <- vroom::vroom_fwf(file = snp_data, # fast-loading data as tibble
                                  col_positions = vroom::fwf_widths(rep(1, sampN.full), col_names = NULL),
                                  col_types = paste(rep("i", sampN.full), collapse=""))
      snpN.full <- nrow(snp_dat) # number of SNP
      message(paste("Imported", snpN.full, "SNP by", sampN.full, "sample eigenstrat genotype matrix (decompressed or unpacked format)"))
      message(paste0("Time elapsed: ", get.time(startT)))
      message("Filtering data...")
      #convert to matrix
      snp_dat1 <- do.call(cbind, snp_dat[, sample_PCA])
    } else { # compressed binary format
      ss <- strsplit(snp_data, "\\.")[[1]]
      snp_data <- paste(ss[-length(ss)], collapse=".") # remove suffix
      snp_dat <- read_packedancestrymap(pref = snp_data)$geno
      attr(snp_dat, "dimnames") <- NULL # remove row and column name attributes
      snpN.full <- nrow(snp_dat) # number of SNP
      message(paste("Imported", snpN.full, "SNP by", sampN.full, "sample packed eigenstrat genotype matrix (compressed or packed format)"))
      message(paste0("Time elapsed: ", get.time(startT)))
      message("Filtering data...")
      snp_dat1 <- snp_dat[, sample_PCA]
      missing_value <- NA # reset missing value
    }
  } else { # generic input type (columns = samples, rows = SNPs)
    con <- file(snp_data,"r"); first_line <- readLines(con,n=1); close(con) # Read first line
    plink_traw_format_flag <- FALSE
    if (substr(first_line, 1, 8) == "CHR\tSNP\t") plink_traw_format_flag <- TRUE # Check for PLINK "traw" header line
    snp_dat <- data.table::fread(file = snp_data, header = plink_traw_format_flag)
    if (plink_traw_format_flag) snp_dat[, c("CHR","SNP", "(C)M", "POS", "COUNTED", "ALT"):=NULL] # If PLINK "traw" format, then remove non-genotype columns
    snpN.full <- nrow(snp_dat) # number of SNP
    message(paste("Imported", snpN.full, "SNP by", sampN.full, "sample genotype matrix"))
    message(paste0("Time elapsed: ", get.time(startT)))
    message("Filtering data...")
    snp_dat1 <- do.call(cbind, snp_dat[, sample_PCA, with = FALSE])
  }

  # Print error if users enter number of sample labels (sample_group) larger or smaller than number of samples in dataset (snp_dat)
  if (length(sample_group) != ncol(snp_dat)) {
    stop("length(sample_group) should be equal to number of samples in dataset: computation aborted")
  }

  # Remove original dataset from memory
  rm(snp_dat); gc()

  # Filter SNPS
  if (!isFALSE(snp_remove)) {
    snp.keep <- setdiff(1:snpN.full, snp_remove)
    snpN.full <- length(snp.keep) # update SNP count
    snp_dat1 <- snp_dat1[snp.keep, ] # subset SNPs by row number across modern samples
  }

  if (snpN.full < 3) {
    stop("Less than 3 SNPs remaining: computation aborted")
  }

  message(paste(snpN.full, "SNPs included in PERMDISP computation"))
  message(paste(length(snp_remove), "SNPs omitted from PCA computation"))

  # Print number of samples used in PERMDISP
  message(paste(length(sample_PCA), "samples included in PERMDISP computation"))
  if (!isFALSE(sample_remove)) {
    message(paste(length(sample_remove), "samples ommitted from PERMDISP computation"))
  }
  message("Completed data filtering")
  message(paste0("Time elapsed: ", get.time(startT)))

  ##--------------------------------------------------------------##
  # 3. Remove invariant SNPs
  ##--------------------------------------------------------------##

  message("Scanning for invariant SNPs...")

  # Identify invariant SNPs allowing for missing data
  if (is.na(missing_value)) { # missing values are NAs
    genoMean <- rowMeans(snp_dat1, na.rm = TRUE) # compute SNP means
    genoVar <- Rfast::rowVars(snp_dat1, na.rm = TRUE) # compute SNP variances
  } else { # missing values are numeric
    genoTab <- Rfast::rowTabulate(snp_dat1) # count frequency of values per SNP
    genoTab <- cbind(sampN - rowSums(genoTab), genoTab) # include 0's
    GTR <- setdiff(1:ncol(genoTab) - 1, missing_value)
    sumN <- rowSums(genoTab[, GTR+1])
    sumX <- foreach::foreach(i = GTR, .combine = "+") %do% {
      i * genoTab[, i+1]
    }
    sumX2 <- foreach::foreach(i = GTR, .combine = "+") %do% {
      i^2 * genoTab[, i+1]
    }
    genoMean <- sumX / sumN
    genoVar <- ((sumX2 / sumN) - genoMean ^ 2) * (sumN / (sumN - 1))
    rm(sumN, sumX, sumX2)
  }
  keepSNPs <-  which(genoVar != 0) # row number for SNPs with > 0 variance #SHP

  # Remove invariant SNPs
  if (length(keepSNPs > 0)) {
    snp_dat1 <- snp_dat1[keepSNPs, ] # remove invariant SNPs from modern samples
    genoMean <- genoMean[keepSNPs] # compute SNP means
    genoVar <- genoVar[keepSNPs] # compute SNP variances
    # if (!isFALSE(sample_project)) {
    #   snp_dat2 <- snp_dat2[keepSNPs, ] # remove invariant SNPs from ancient samples
    # }
  }
  rm(keepSNPs) # remove vector with indices for variant SNPs
  snpN <- nrow(snp_dat1) # number of SNPs with > 0 variance

  if (snpN == snpN.full) {
    message("Scan complete: no invariant SNPs found")
  } else {
    message(paste("Scan complete: removed", snpN.full - snpN, "invariant SNPs"))
  }
  message(paste0("Time elapsed: ", get.time(startT)))

  ##--------------------------------------------------------------##
  # 4. Deal with missing values (SNP removal or imputation)
  ##--------------------------------------------------------------##

  message("Checking for missing values...")

  # Index cells with missing value (counting sequence = top to bottom then left to right sequence)
  if (is.na(missing_value)) { # missing_value = NA
    missI <- which(is.na(snp_dat1))
  } else { # missing_value = numeric
    missI <- which(snp_dat1 == missing_value)
  }

  # Impute SNP with missing values with mean genotype
  if (length(missI) > 0) {
    # Determine SNPs with missing values
    Z <- missI %% snpN # label SNPs (rows) with missing values with 0
    Z[Z == 0] <- snpN
    snpMissI <- sort(unique(Z)) # row index for SNPs with missing values
    snpMissN <- length(snpMissI) # total SNPs with missing values
    message(paste(snpMissN, "SNPs contain missing values"))
    # Mean imputation
    if (missing_impute == "mean") {
      message("Imputing mean for SNPs with missing values...")
      snp_dat1[missI] <- genoMean[Z]
      message(paste("Imputation completed:", length(missI), "missing values imputed"))
    }
    # Removal of SNPs with missing data
    if (missing_impute == "remove") {
      message("Removing SNPs with missing data...")
      snp_dat1 <- snp_dat1[-snpMissI, ] #SNP
      genoMean <- genoMean[-snpMissI] # update genotype means
      genoVar <- genoVar[-snpMissI] # update genotype variance
      snpN <- nrow(snp_dat1)
      message(paste("Removal completed:", snpMissN, "SNPs removed"))
      message(paste(snpN, "SNPs remaining"))
    }
    rm(missI, Z)
  } else {
    message("Scan completed: no missing values found")
    rm(missI) # remove vector with cell number when missing value present
  }
  message(paste0("Time elapsed: ", get.time(startT)))

  ##--------------------------------------------------------------##
  # 5. Scale values by SNP
  ##--------------------------------------------------------------##

  if(scaling != "none") { # only run if standardization is requested
    message("Scaling values by SNP...")

    if (scaling == "drift") { # standardization controlling for genetic drift following Formula (3) by Patterson et al. 2006 (doi: 10.1371/journal.pgen.0020190)
      message("Centering and scaling by drift dispersion...")
      snp_drift <- sqrt((genoMean / 2) * (1 - genoMean/2))
      snp_dat1 <- (snp_dat1 - genoMean) / snp_drift
    }

    if (scaling == "sd") { # z-score standardization whereby all SNPs have mean = 0 and variance = 1
      message("Centering and standardizing (z-scores)...")
      if (missing_impute == "mean") {
        snp_sd <- Rfast::rowVars(snp_dat1, std = TRUE) # update variances to account for imputation
      } else { # calculate sd on precomputed variance after removing invariant snps
        snp_sd <- sqrt(genoVar)
      }
      snp_dat1 <- (snp_dat1 - genoMean) / snp_sd
    }

    if (scaling == "center") { # no standardization: this option runs a conventional PCA on raw (centered) data
      message("Centering...")
      snp_dat1 <- snp_dat1 - genoMean
    }
    message("Completed scaling")
    message(paste0("Time elapsed: ", get.time(startT)))
  } else {
    message("Note: SNP-based scaling not used")
  }

  ##--------------------------------------------------------------##
  # 6. Construct triangular inter-sample distance matrix
  ##--------------------------------------------------------------##

  # Compute sample by sample matrix of proximities using Rfast::Dist (fast) or vegan::vgdist (slow)
  if (target_space == "multidimensional") { # proximity matrix computed from raw data (SNP x sample)
    message("Construct triangular matrix of sample by sample proximities in multidimensional space...")
    if (program_distance == "Rfast") { # Rfast
      snp_eucli <- Rfast::Dist(t(snp_dat1), method = sample_distance) # default: method = "euclidean" (= Euclidean distance)
    }
    if (program_distance == "vegan") { # vegan
      snp_eucli <- vegan::vegdist(t(snp_dat1), method = sample_distance) # default: method = "euclidean" (= Euclidean distance)
    }
  }
  if (target_space == "pca") { # proximity matrix computed from PCA (principal components x sample)
    message("Construct triangular matrix of sample by sample proximities in PCA space...")
    snp_pca <- RSpectra::svds(t(snp_dat1), k = pc_axes) # single value decomposition for k axes
    snp_pc <- data.frame(snp_pca$u %*% diag(snp_pca$d)) # one column per PCA axis, one row per sample
    colnames(snp_pc) <- paste0("PC", c(1:ncol(snp_pc))) # rename columns with PC + number
    if (program_distance == "Rfast") { # Rfast
      snp_eucli <- Rfast::Dist(snp_pc, method = "euclidean") # default: method = "euclidean" (= Euclidean distance)
    }
    if (program_distance == "vegan") { # vegan
      snp_eucli <- vegan::vegdist(snp_pc, method = "euclidean") # default: method = "euclidean" (= Euclidean distance)
    }
  }

  message("Completed construction of triangular matrix of sample by sample proximities...")

  ##--------------------------------------------------------------##
  # 7. Compute inter- and intra-group dispersion by PERMDISP
  ##--------------------------------------------------------------##

  message(paste0("Compute inter- and intra-group dispersion by PERMDISP: global test..."))

  # Create group variable
  group <- factor(as.matrix(sample_group[sample_PCA]))

  # Set seed of random generator
  set.seed(permutation_seed)

  # Compute PERMDISP (global test)
  dispCent <- vegan::betadisper(d = stats::as.dist(snp_eucli), group = group, type = dispersion_type, bias.adjust = samplesize_bias) # estimate group central points (median or centroid)
  globalTable.disp <- vegan::permutest(dispCent, pairwise = pairwise, permutations = permutation_n)[[1]][, -5] # run test and extract to ANOVA table and # remove 5th column with number of permutations

  # Extract group spatial median (or centroid) and sample distances to spatial median (or dentroid)
  dispGroup <- dispCent$centroids

  # Extract sample distances to spatial median (or centroid)
  dispSample <- dispCent$distances

  if (dispersion_type == "median") {
    message("Completed PERMDISP: global test based on group spatial medians")
  } else {
    message("Completed PERMDISP: global test based on group centroids")
  }
  message(paste0("Time elapsed: ", get.time(startT)))

  # Compute PERMDISP (pairwise tests) / if data contains two groups, pairwise test should mirror global test
  if (pairwise == TRUE & length(levels(group)) > 2) {
    message("Compute inter- and intra-group dispersion by PERMDISP: pairwise testswith/without", pairwise_method, "correction...")
    test.pair <- function(snp_eucli, group, nper = permutation_n, corr.method = pairwise_method) {
      comb.fact <- utils::combn(levels(group), 2)
      F.Model <- NULL
      pv.disp <- NULL
      for (i in 1:dim(comb.fact)[2]) { # iterate over group pair-wise comparisons
        message(paste0("Computing pairs ", paste(comb.fact[, i], collapse = " & "), "..."))
        GN <- group %in% comb.fact[, i]
        if (program_distance == "vegan") {
          dispCent_pair <- vegan::betadisper(d = stats::as.dist(as.matrix(snp_eucli, ncol = sampN)[GN, GN]), group[GN], type = dispersion_type, bias.adjust = samplesize_bias)
        } else {
          dispCent_pair <- vegan::betadisper(d = stats::as.dist(snp_eucli[GN, GN]), group[GN], type = dispersion_type, bias.adjust = samplesize_bias)
        }
        model.temp <- vegan::permutest(dispCent_pair, pairwise = FALSE, permutations = permutation_n)
        F.Model <- c(F.Model, model.temp$tab[[4]][1])
        pv.disp <- c(pv.disp, model.temp$tab[[6]][1])
      }
      pv.corr <- stats::p.adjust(pv.disp, method = pairwise_method)
      data.frame(GroupPair = paste(comb.fact[1,], comb.fact[2,], sep= "-"), F.Model = F.Model, P.value = pv.disp, P.value.corrected = pv.corr)
    }
    pairwiseTable <- test.pair(snp_eucli, group, nper = permutation_n) # run pairwise tests
    if (dispersion_type == "median") {
      message("Completed PERMDISP: pairwise tests based on group spatial medians")
    } else {
      message("Completed PERMDISP: pairwise tests based on group centroids")
    }
    message(paste0("Time elapsed: ", get.time(startT)))
  }
  if (pairwise == TRUE & length(levels(group)) == 2) {
    message(paste0("Pairwise tests not computed because number of groups is 2"))
  }

  if (samplesize_bias == TRUE) {
    disp_bias <- "Dispersion adjusted to number of samples per group"
  } else {
    disp_bias <- "Dispersion NOT adjusted to number of samples per group"
  }

  ##--------------------------------------------------------------##
  # 8. Tabulate analytical summary
  ##--------------------------------------------------------------##

  message("Tabulating PERMDISP outputs...")

  # Compile PERMDISP results in a list
  sample.out <- merge(samp_dat, data.table::data.table(I = sample_PCA, Class = "PERMDISP", Sample_dispersion = dispSample), by = "I") # summary of samples whether included in PCA, projected and/or removed
  if (!isFALSE(sample_remove)) {
    sample.out <- rbind(sample.out, data.table::data.table(I = sample_remove, Group = sample_group[sample_remove], Class = "Removed", Sample_dispersion = NA))
  }
  sample.out <- sample.out[order(I)] # retain original sample order
  sample.out[, I:=NULL] #remove index column
  if (pairwise == TRUE & length(levels(group)) > 2) {
    OUT <- list(permdisp.samples = as.data.frame(sample.out), permdisp_bias = disp_bias,
                permdisp.group_location = dispGroup, permdisp.global_test = globalTable.disp, permdisp.pairwise_test = pairwiseTable, permdisp.pairwise_correction = pairwise_method,
                permdisp.permutation_number = permutation_n, permdisp.permutation_seed = permutation_seed) # create list
  } else {
    OUT <- list(permdisp.samples = as.data.frame(sample.out), permdisp.bias = disp_bias,
                permdisp.group_location = dispGroup, permdisp.global_test = globalTable.disp, permdisp.pairwise_test = "No pairwise testing implemented",
                permdisp.permutation_number = permutation_n, permdisp.permutation_seed = permutation_seed) # create list
  }
  message("Completed tabulations of PERMDISP outputs")
  return(OUT)
}
# smartsnp v.1
# Coding end date = 08/09/2020
# smartsnp::smart_permdisp by Salvador Herrando-Perez (salherra@gmail.com) and Ray Tobler (tingalingx@gmail.com)