Nothing
# smartsnp v.1
# Coding start date = 07/07/2020
# smartsnp::smart_permdisp by Salvador Herrando-Perez (salherra@gmail.com) and Ray Tobler (tingalingx@gmail.com)
#
# Permutational Multivariate Analysis of dispersion (PERMDISP) of genotype data
# Computes global and pairwise MANOVA tests for differences in GROUP DISPERSION for multiple types of proximity measures between samples
# Alpha values estimated by permutations
# Implements scaling by centering, z-scored and SMARTPCA controlling for genetic drift
# Original description of PERMDISP at https://doi.org/10.1111/j.1461-0248.2006.00926.x / Anderson, Ellingsen & McArdle (2006) Multivariate dispersion as a measure of beta diversity. Ecology Letters, 9, 683-693
#' @name smart_permdisp
#' @title Smart Permutational Multivariate Analysis of Dispersion
#'
#' @description Computes Permutational Multivariate Analysis of Dispersion (PERMDISP) in group dispersion using multivariate data.
#' Variance partitioning computed on a sample-by-sample triangular matrix obtained from variable-by-sample data following Anderson (2006).
#' Calculates a range of inter-sample distances, similarities and dissimilarities. Includes control for genetic drift for bi-allelic genetic markers including single nucleotide polymorphisms (SNP) following Patterson, Price and Reich (2006) that can be combined with SMART Principal Component Analysis (PCA).
#' Optimized to run fast matrix building and permutations for big datasets in ecological, evolutionary and genomic research.\cr
#'
#' @details PERMDISP is a form of \code{homoscedasticity} test analogous to univariate Levene's (1960) and, more closely, Brown & Forsythe's (1974) tests. Applies PERMANOVA test (Anderson 2001) for differences in Euclidean dispersion among groups (Anderson 2006).
#' Proximity between samples can be any type of distance, similarity or dissimilarity.
#' Group dispersion estimated relative to group centroids (central point) or to spatial (geometric) medians (point minimizing distance to group samples) in Principal Coordinate Analysis (\code{PCoA}, Gower 1966) space.
#' Acronym PERMDISP originates from Marti Anderson's FORTRAN program (Anderson 2004).
#'
#' Control for unequal number of samples among groups optionally done by weighting sample distance to group spatial \emph{median} or \emph{centroid} by \emph{sqrt(n/(n-1))} (O'Neill & Mathews 2000), and the null hypothesis then being tested changes from \emph{d_1 = d_2 = ... = d_t} (balanced design) to \emph{((n_1-1)/n_1) x d_1 = ((n_2-1)/n_2) x d_2 = ... = ((n_t-1)/n_t) x d_t} (unbalanced design) where \emph{d} represents dispersion of groups \emph{1} to \emph{t}, and \emph{n} represents number of samples per group.
#' To attribute group differences to location (position of sample groups) and/or dispersion (spread of sample groups), PERMDISP must be combined with PERMANOVA as implemented through \code{smart_permanova}.\cr
#'
#' \code{smart_permdisp} uses \code{\link[vegan]{betadisper}} to estimate an ANOVA \emph{F} statistic and group dispersions using formula \code{snp_eucli ~ sample_group}, where \code{snp_eucli} is the sample-by-sample triangular matrix in \code{PCoA} space.
#' If >2 sample groups tested, \code{pairwise = TRUE} allows pairwise testing and correction for multiple testing by \code{holm (Holm)} [default], \code{hochberg (Hochberg)}, \code{hommel (Hommel)}, \code{bonferroni (Bonferroni)}, \code{BY (Benjamini-Yekuieli)}, \code{BH (Benjamini-Hochberg)} or \code{fdr (False Discovery Rate)}.\cr
#'
#' For big data, \code{\link[Rfast]{Dist}} builds sample-by-sample triangular matrix much faster than \code{\link[vegan]{vegdist}}. \code{\link[Rfast]{Dist}} computes proximities \code{euclidean}, \code{manhattan}, \code{canberra1}, \code{canberra2}, \code{minimum}, \code{maximum}, \code{minkowski}, \code{bhattacharyya}, \code{hellinger}, \code{kullback_leibler} and \code{jensen_shannon}. \code{\link[vegan]{vegdist}} computes \code{manhattan}, \code{euclidean}, \code{canberra}, \code{clark}, \code{bray}, \code{kulczynski}, \code{jaccard}, \code{gower}, \code{altGower}, \code{morisita}, \code{horn}, \code{mountford}, \code{raup}, \code{binomial}, \code{chao}, \code{cao} and \code{mahalanobis}.
#' Euclidean distance required for SMARTPCA scaling.\cr
#'
#' \code{sample_remove} should include both samples removed from PCA and ancient samples projected onto PCA space (if any).\cr
#'
#' Data read from working directory with SNPs as rows and samples as columns. Two alternative formats: (1) text file of SNPs by samples (file extension and column separators recognized automatically) read using \code{\link[data.table]{fread}}; or (2) duet of \code{EIGENSTRAT} files (see \url{https://reich.hms.harvard.edu/software}) using \code{\link[vroom]{vroom_fwf}}, including a genotype file of SNPs by samples \code{*.geno}, and a sample file (\code{*.ind}) containing three vectors assigning individual samples to unique user-predefined groups (populations), sexes (or other user-defined descriptor) and alphanumeric identifiers. For \code{EIGENSTRAT}, vector \code{sample_group} assigns samples to groups retrievable from column 3 of file \code{*.ind}.
#' SNPs with zero variance removed prior to SVD to optimize computation time and avoid undefined values if \code{scaling = "sd"} or \code{"drift"}.\cr
#'
#' Users can select subsets of samples or SNPs by introducing a vector including column numbers for samples (\code{sample_remove}) and/or row numbers for SNPs (\code{snp_remove}) to be removed from computations.
#' Function stops if the final number of SNPs is 1 or 2.
#' \code{EIGENSOFT} was conceived for the analysis of human genes and its \code{SMARTPCA} suite so accepts 22 (autosomal) chromosomes by default.
#' If >22 chromosomes are provided and the internal parameter \code{numchrom} is not set to the target number chromosomes of interest, SMARTPCA automatically subsets chromosomes 1 to 22.
#' In contrast, \code{smart_permdisp} accepts any number of autosomes with or without the sex chromosomes from an \code{EIGENSTRAT} file.\cr
#'
#' @param snp_data File name read from working directory. SNP = rows, samples = columns without row names or column headings.
#' SNP values must be count data (no decimals allowed).
#' File extension detected automatically whether text or \code{EIGENSTRAT}.
#' See details.
#' @param packed_data Logical value for \code{EIGENSTRAT}, irrelevant for text data.
#' Default \code{packed_data = FALSE} assumes uncompressed \code{EIGENSTRAT}.
#' \code{packed_data = TRUE} for compressed or binary \code{EIGENSTRAT} (\code{PACKENDANCESTRYMAP}).
#' @param sample_group Character or numeric vector assigning samples to groups. Coerced to factor.
#' @param sample_remove Logical \code{FALSE} or numeric vector indicating column numbers (samples) to be removed from computations.
#' Default \code{sample_remove = FALSE} keeps all samples.
#' @param snp_remove Logical \code{FALSE} or numeric vector indicating row numbers (SNPs) to be removed from computations.
#' Default \code{snp_remove = FALSE} keeps all SNPs.
#' See details.
#' @param missing_value Number \code{9} or string \code{NA} indicating missing value.
#' Default \code{missing_value = 9} as in \code{EIGENSTRAT}.
#' If no missing values present, no effect on computation.
#' @param missing_impute String handling missing values. Default \code{missing_impute = "mean"} replaces missing values of each SNP by mean of non-missing values across samples.
#' \code{missing_impute = "remove"} removes SNPs with at least one missing value.
#' If no missing values present, no effect on computation.
#' @param scaling String. Default \code{scaling = "drift"} scales SNPs to control for expected allele frequency dispersion caused by genetic drift (SMARTPCA). \code{scaling = "center"} for \code{centering} (covariance-based PCA). \code{scaling = "sd"} for \code{centered} SNPs divided by standard deviation (correlation-based PCA).
#' \code{scaling = "none"} for no scaling.
#' See details.
#' @param sample_distance Type of inter-sample proximity computed (distance, similarity, dissimilarity).
#' Default is \code{Euclidean distance}.
#' See details.
#' @param program_distance A string value indicating R package to estimate proximities between pairs of samples.
#' Default \code{program_distance = "Rfast"} uses function \code{\link[Rfast]{Dist}}; \code{program_distance = "vegan"} uses \code{\link[vegan]{vegdist}}.
#' See details.
#' @param target_space String.
#' Default \code{target_space = "multidimensional"} applies PERMANOVA to sample-by-sample triangular matrix computed from variable-by-sample data, \code{pc_axes} has no effect on computation.
#' \code{target_space = "pca"} applies PERMANOVA to sample-by-sample data in PCA space, \code{pc_axes} determines number of PCA axes for testing.
#' @param pc_axes Number of PCA axes computed always starting with PCA axis 1.
#' Default \code{pc_axes = 2} computes PCA axes 1 and 2 if \code{target_space = "pca"}.
#' No effect on computation if \code{target_space = "multidimensional"}.
#' @param pairwise Logical.
#' Default \code{pairwise = FALSE} computes global test.
#' \code{pairwise = TRUE} computes global and pairwise tests.
#' @param pairwise_method String specifying type of correction for multiple testing.
#' Default \code{"holm"}.
#' See details.
#' @param permutation_n Number of permutations resulting in PERMDISP test \emph{p value}.
#' Default \code{9999}.
#' @param permutation_seed Number fixing random generator of permutations.
#' Default \code{1}.
#' @param dispersion_type String indicating quantification of group dispersion whether relative to spatial \code{"median"} or \code{"centroid"}.
#' Default \code{"median"}. See details.
#' @param samplesize_bias Logical.
#' \code{samplesize_bias = TRUE} for dispersion weighted by number of samples per group.
#' Default \code{pairwise = FALSE} for no weighting.
#' See details.
#'
#' @return Returns a list containing the following elements:
#' \itemize{
#' \item{permdisp.samples}{Dataframe showing sample summary.
#' Column \emph{Group} assigns samples to tested groups.
#' Column \emph{Class} specifies if samples were used in, or removed from, testing (PERMDISP).
#' Column \emph{Sample_dispersion} shows dispersion of individual samples relative to spatial \code{"median"} or \code{"centroid"}.}
#' \item{permdisp.bias}{String indicating if PERMDISP dispersions corrected for number of samples per group.}
#' \item{permdisp.group_location}{Dataframe showing coordinates of spatial \code{"median"} or \code{"centroid"} per group.}
#' \item{permdisp.global_test}{List showing table with degrees of freedom, sum of squares, mean sum of squares, \emph{F} statistic and \emph{p} value.}
#' \item{permdisp.pairwise_test}{List showing table with \emph{F} statistic, \emph{p} value and corrected \emph{p} value per pair of groups.
#' Obtained only if \code{pairwise = TRUE}.}
#' \item{permdisp.pairwise_correction}{String indicating type of correction for multiple testing.}
#' \item{permdisp.permutation_number}{Number of permutations applied to obtain the distribution of \emph{F} statistic.}
#' \item{permdisp.permutation_seed}{Number fixing random generator of permutations for reproducibility of results.}
#' }
#'
#' @examples
#' # Path to example genotype matrix "dataSNP"
#' pathToGenoFile = system.file("extdata", "dataSNP", package = "smartsnp")
#'
#' # Assign 50 samples to each of two groups and colours
#' my_groups <- as.factor(c(rep("A", 50), rep("B", 50))); cols = c("red", "blue")
#'
#' # Run PERMDISP
#' permdispR <- smart_permdisp(snp_data = pathToGenoFile, sample_group = my_groups)
#'
#' # Extract summary table assigning samples to groups and dispersion of individual samples
#' permdispR$permdisp.samples
#'
#' # Extract PERMDISP table
#' permdispR$permdisp.global_test
#'
#' # Plot sample distances to group central medians
#' #run pca with truncated SVD (PCA 1 x PCA 2)
#' pcaR1 <- smart_pca(snp_data = pathToGenoFile, sample_group = my_groups)
#' #compute Euclidean inter-sample distances in PCA space (triangular matrix)
#' snp_eucli <- vegan::vegdist(pcaR1$pca.sample_coordinates[,c("PC1","PC2")], method = "euclidean")
#' #calculate spatial medians
#' disMed <- vegan::betadisper(d = snp_eucli, group = my_groups); disMed
#' #plot
#' oldpar <- par(mar = c(4, 4, 5, 0.1), lwd = 2)
#' boxplot(disMed, las =2, cex.axis = 2, cex.main = 1.5, horizontal = TRUE, varwidth = TRUE,
#' col = cols, xlab = "", ylab = "", main = "Sample distance to group spatial medians")
#' par(oldpar)
#'
#' @references
#' Anderson, M. J. (2001) A new method for non-parametric multivariate analysis of variance. Austral Ecology, 26, 32-46.\cr
#' Anderson, M. J. (2004). PERMANOVA_2factor: a FORTRAN computer program for permutational multivariate analysis of variance (for any two-factor ANOVA design) using permutation tests (Department of Statistics, University of Auckland, New Zealand).\cr
#' Anderson, M. J. & D. C. I. Walsh (2013) PERMANOVA, ANOSIM, and the Mantel test in the face of heterogeneous dispersions: What null hypothesis are you testing? Ecological Monographs, 83, 557-574.\cr
#' Gower, J. C. (1966) Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53, 325-338.\cr
#' McArdle, B. H. & M. J. Anderson (2001) Fitting multivariate models to community data: a comment on distance-based redundancy analysis. Ecology, 82, 290-297.\cr
#' Patterson, N., A. L. Price and D. Reich (2006) Population structure and eigenanalysis. PLoS Genetics, 2, e190.\cr
#' Warton, D. I., S. T. Wright and Y. Wang (2012) Distance-based multivariate analyses confound location and dispersion effects. Methods in Ecology and Evolution, 3, 89-101.
#'
#' @seealso \code{\link[vegan]{adonis}} (package \bold{vegan}),
#' \code{\link[Rfast]{Dist}} (package \bold{Rfast}),
#' \code{\link[data.table]{fread}} (package \bold{data.table}),
#' \code{\link[vegan]{vegdist}} (package \bold{vegan}),
#' \code{\link[vroom]{vroom_fwf}} (package \bold{vroom})
#'
#' @importFrom data.table :=
#' @importFrom data.table .N
#' @importFrom foreach %do%
#' @export
utils::globalVariables(c("i"))
smart_permdisp <- function(snp_data, packed_data = FALSE,
sample_group, sample_remove = FALSE, snp_remove = FALSE,
missing_value = 9, missing_impute = "remove",
scaling = "drift",
sample_distance = "euclidean", program_distance = "Rfast",
target_space = "multidimensional", pc_axes = 2,
pairwise = FALSE, pairwise_method = "holm",
permutation_n = 9999, permutation_seed = 1,
dispersion_type = "median", samplesize_bias = FALSE) {
##--------------------------------------------------------------##
# 1. Timing analytical steps
##--------------------------------------------------------------##
# Overall starting time
startT <- Sys.time()
# cats cumulative run time at specific points
get.time <- function(t) { # time conversion
tt <- as.numeric(difftime(Sys.time(), t, units = "secs"))
H <- tt %/% 3600
rr <- tt %% 3600
if(rr > 0) {
M <- rr %/% 60
rr2 <- rr %% 60
if(rr2 > 0) {
S <- round(rr2)
}else{
S <- 0
}
} else {
M <- 0
S <- 0
}
return(paste0(H, "h ", M, "m ", S, "s"))
}
##--------------------------------------------------------------##
# 2. Load data and filter samples and SNPs
##--------------------------------------------------------------##
message("Checking argument options selected...")
# Check options: convert to logical if needed
if (!is.numeric(sample_remove)) sample_remove <- as.logical(sample_remove)
# Print error and abort analysis
if (missing_impute != "remove" & missing_impute != "mean") { # if users misspel missing_impute
stop("Check spelling: missing_impute must be 'remove' or 'mean'")
}
if (missing_value != 9 & !is.na(missing_value)) { # if users assign missing values not equal to 9 or NA
stop("Missing values must be coded as 9 or NA")
}
if (scaling != "none" & scaling != "center" & scaling != "sd" & scaling != "drift") { # # if users misspel scaling
stop("Check spelling: scaling must be 'none', 'center', 'sd' or 'drift'")
}
if (program_distance != "Rfast" & program_distance != "vegan") { # if users misspel program_distance
stop("Check spelling: program_distance must be 'Rfast' or 'vegan'")
}
if (target_space != "multidimensional" & target_space != "pca") { # if users misspel target_space
stop("Check spelling: target_space must be 'multidimensional' or 'pca'")
}
message("Argument options are correct...")
message("Loading data...")
# Label samples
samp_dat <- data.table::data.table(sample_group) # tabulate group labels
data.table::setnames(samp_dat, "Group") # name column vector of group labels
samp_dat[, I:= 1:.N] # cbind column vectors with samples listed from 1 to length(sample_group)
sampN.full <- nrow(samp_dat) # total number of samples in dataset
# Define samples used for PERMDISP
sample_PCA <- 1:sampN.full # assume all samples for PERMDISP as default
if (!isFALSE(sample_remove)) {
sample_PCA <- setdiff(sample_PCA, sample_remove) # remove excluded samples
}
# Recalculate number of samples for PERMDISP
sampN <- length(sample_PCA)
# Avoid error if users unfamiliar with data formats set packed_data = FALSE but their data is binary (packedancestrymap format) following https://stackoverflow.com/questions/16350164/native-method-in-r-to-test-if-file-is-ascii
is.binary <- function(filepath, max = 1000) {
f = file(filepath, "rb", raw = TRUE)
b = readBin(f, "int", max, size = 1, signed = FALSE)
close(f)
return(max(b) > 128)
}
if (is.binary(snp_data) == TRUE) {
packed_data = TRUE
message("Data is binary (packedancestrymap)...")
} else {
packed_data = FALSE
}
# Read in data (columns = SNPs, rows = samples)
if(length(grep("\\.geno$", snp_data)) == 1) { # eigenstrat format
if(isFALSE(packed_data)) { # decompressed format
snp_dat <- vroom::vroom_fwf(file = snp_data, # fast-loading data as tibble
col_positions = vroom::fwf_widths(rep(1, sampN.full), col_names = NULL),
col_types = paste(rep("i", sampN.full), collapse=""))
snpN.full <- nrow(snp_dat) # number of SNP
message(paste("Imported", snpN.full, "SNP by", sampN.full, "sample eigenstrat genotype matrix (decompressed or unpacked format)"))
message(paste0("Time elapsed: ", get.time(startT)))
message("Filtering data...")
#convert to matrix
snp_dat1 <- do.call(cbind, snp_dat[, sample_PCA])
} else { # compressed binary format
ss <- strsplit(snp_data, "\\.")[[1]]
snp_data <- paste(ss[-length(ss)], collapse=".") # remove suffix
snp_dat <- read_packedancestrymap(pref = snp_data)$geno
attr(snp_dat, "dimnames") <- NULL # remove row and column name attributes
snpN.full <- nrow(snp_dat) # number of SNP
message(paste("Imported", snpN.full, "SNP by", sampN.full, "sample packed eigenstrat genotype matrix (compressed or packed format)"))
message(paste0("Time elapsed: ", get.time(startT)))
message("Filtering data...")
snp_dat1 <- snp_dat[, sample_PCA]
missing_value <- NA # reset missing value
}
} else { # generic input type (columns = samples, rows = SNPs)
snp_dat <- data.table::fread(file = snp_data, header = FALSE)
snpN.full <- nrow(snp_dat) # number of SNP
message(paste("Imported", snpN.full, "SNP by", sampN.full, "sample genotype matrix"))
message(paste0("Time elapsed: ", get.time(startT)))
message("Filtering data...")
snp_dat1 <- do.call(cbind, snp_dat[, sample_PCA, with = FALSE])
}
# Print error if users enter number of sample labels (sample_group) larger or smaller than number of samples in dataset (snp_dat)
if (length(sample_group) != ncol(snp_dat)) {
stop("length(sample_group) should be equal to number of samples in dataset: computation aborted")
}
# Remove original dataset from memory
rm(snp_dat); gc()
# Filter SNPS
if (!isFALSE(snp_remove)) {
snp.keep <- setdiff(1:snpN.full, snp_remove)
snpN.full <- length(snp.keep) # update SNP count
snp_dat1 <- snp_dat1[snp.keep, ] # subset SNPs by row number across modern samples
}
if (snpN.full < 3) {
stop("Less than 3 SNPs remaining: computation aborted")
}
message(paste(snpN.full, "SNPs included in PERMDISP computation"))
message(paste(length(snp_remove), "SNPs omitted from PCA computation"))
# Print number of samples used in PERMDISP
message(paste(length(sample_PCA), "samples included in PERMDISP computation"))
if (!isFALSE(sample_remove)) {
message(paste(length(sample_remove), "samples ommitted from PERMDISP computation"))
}
message("Completed data filtering")
message(paste0("Time elapsed: ", get.time(startT)))
##--------------------------------------------------------------##
# 3. Remove invariant SNPs
##--------------------------------------------------------------##
message("Scanning for invariant SNPs...")
# Identify invariant SNPs allowing for missing data
if (is.na(missing_value)) { # missing values are NAs
genoMean <- rowMeans(snp_dat1, na.rm = TRUE) # compute SNP means
genoVar <- Rfast::rowVars(snp_dat1, na.rm = TRUE) # compute SNP variances
} else { # missing values are numeric
genoTab <- Rfast::rowTabulate(snp_dat1) # count frequency of values per SNP
genoTab <- cbind(sampN - rowSums(genoTab), genoTab) # include 0's
GTR <- setdiff(1:ncol(genoTab) - 1, missing_value)
sumN <- rowSums(genoTab[, GTR+1])
sumX <- foreach::foreach(i = GTR, .combine = "+") %do% {
i * genoTab[, i+1]
}
sumX2 <- foreach::foreach(i = GTR, .combine = "+") %do% {
i^2 * genoTab[, i+1]
}
genoMean <- sumX / sumN
genoVar <- ((sumX2 / sumN) - genoMean ^ 2) * (sumN / (sumN - 1))
rm(sumN, sumX, sumX2)
}
keepSNPs <- which(genoVar != 0) # row number for SNPs with > 0 variance #SHP
# Remove invariant SNPs
if (length(keepSNPs > 0)) {
snp_dat1 <- snp_dat1[keepSNPs, ] # remove invariant SNPs from modern samples
genoMean <- genoMean[keepSNPs] # compute SNP means
genoVar <- genoVar[keepSNPs] # compute SNP variances
# if (!isFALSE(sample_project)) {
# snp_dat2 <- snp_dat2[keepSNPs, ] # remove invariant SNPs from ancient samples
# }
}
rm(keepSNPs) # remove vector with indices for variant SNPs
snpN <- nrow(snp_dat1) # number of SNPs with > 0 variance
if (snpN == snpN.full) {
message("Scan complete: no invariant SNPs found")
} else {
message(paste("Scan complete: removed", snpN.full - snpN, "invariant SNPs"))
}
message(paste0("Time elapsed: ", get.time(startT)))
##--------------------------------------------------------------##
# 4. Deal with missing values (SNP removal or imputation)
##--------------------------------------------------------------##
message("Checking for missing values...")
# Index cells with missing value (counting sequence = top to bottom then left to right sequence)
if (is.na(missing_value)) { # missing_value = NA
missI <- which(is.na(snp_dat1))
} else { # missing_value = numeric
missI <- which(snp_dat1 == missing_value)
}
# Impute SNP with missing values with mean genotype
if (length(missI) > 0) {
# Determine SNPs with missing values
Z <- missI %% snpN # label SNPs (rows) with missing values with 0
Z[Z == 0] <- snpN
snpMissI <- sort(unique(Z)) # row index for SNPs with missing values
snpMissN <- length(snpMissI) # total SNPs with missing values
message(paste(snpMissN, "SNPs contain missing values"))
# Mean imputation
if (missing_impute == "mean") {
message("Imputing mean for SNPs with missing values...")
snp_dat1[missI] <- genoMean[Z]
message(paste("Imputation completed:", length(missI), "missing values imputed"))
}
# Removal of SNPs with missing data
if (missing_impute == "remove") {
message("Removing SNPs with missing data...")
snp_dat1 <- snp_dat1[-snpMissI, ] #SNP
genoMean <- genoMean[-snpMissI] # update genotype means
genoVar <- genoVar[-snpMissI] # update genotype variance
snpN <- nrow(snp_dat1)
message(paste("Removal completed:", snpMissN, "SNPs removed"))
message(paste(snpN, "SNPs remaining"))
}
rm(missI, Z)
} else {
message("Scan completed: no missing values found")
rm(missI) # remove vector with cell number when missing value present
}
message(paste0("Time elapsed: ", get.time(startT)))
##--------------------------------------------------------------##
# 5. Scale values by SNP
##--------------------------------------------------------------##
if(scaling != "none") { # only run if standardization is requested
message("Scaling values by SNP...")
if (scaling == "drift") { # standardization controlling for genetic drift following Formula (3) by Patterson et al. 2006 (doi: 10.1371/journal.pgen.0020190)
message("Centering and scaling by drift dispersion...")
snp_drift <- sqrt((genoMean / 2) * (1 - genoMean/2))
snp_dat1 <- (snp_dat1 - genoMean) / snp_drift
}
if (scaling == "sd") { # z-score standardization whereby all SNPs have mean = 0 and variance = 1
message("Centering and standardizing (z-scores)...")
if (missing_impute == "mean") {
snp_sd <- Rfast::rowVars(snp_dat1, std = TRUE) # update variances to account for imputation
} else { # calculate sd on precomputed variance after removing invariant snps
snp_sd <- sqrt(genoVar)
}
snp_dat1 <- (snp_dat1 - genoMean) / snp_sd
}
if (scaling == "center") { # no standardization: this option runs a conventional PCA on raw (centered) data
message("Centering...")
snp_dat1 <- snp_dat1 - genoMean
}
message("Completed scaling")
message(paste0("Time elapsed: ", get.time(startT)))
} else {
message("Note: SNP-based scaling not used")
}
##--------------------------------------------------------------##
# 6. Construct triangular inter-sample distance matrix
##--------------------------------------------------------------##
# Compute sample by sample matrix of proximities using Rfast::Dist (fast) or vegan::vgdist (slow)
if (target_space == "multidimensional") { # proximity matrix computed from raw data (SNP x sample)
message("Construct triangular matrix of sample by sample proximities in multidimensional space...")
if (program_distance == "Rfast") { # Rfast
snp_eucli <- Rfast::Dist(t(snp_dat1), method = sample_distance) # default: method = "euclidean" (= Euclidean distance)
}
if (program_distance == "vegan") { # vegan
snp_eucli <- vegan::vegdist(t(snp_dat1), method = sample_distance) # default: method = "euclidean" (= Euclidean distance)
}
}
if (target_space == "pca") { # proximity matrix computed from PCA (principal components x sample)
message("Construct triangular matrix of sample by sample proximities in PCA space...")
snp_pca <- RSpectra::svds(t(snp_dat1), k = pc_axes) # single value decomposition for k axes
snp_pc <- data.frame(snp_pca$u %*% diag(snp_pca$d)) # one column per PCA axis, one row per sample
colnames(snp_pc) <- paste0("PC", c(1:ncol(snp_pc))) # rename columns with PC + number
if (program_distance == "Rfast") { # Rfast
snp_eucli <- Rfast::Dist(snp_pc, method = "euclidean") # default: method = "euclidean" (= Euclidean distance)
}
if (program_distance == "vegan") { # vegan
snp_eucli <- vegan::vegdist(snp_pc, method = "euclidean") # default: method = "euclidean" (= Euclidean distance)
}
}
message("Completed construction of triangular matrix of sample by sample proximities...")
##--------------------------------------------------------------##
# 7. Compute inter- and intra-group dispersion by PERMDISP
##--------------------------------------------------------------##
message(paste0("Compute inter- and intra-group dispersion by PERMDISP: global test..."))
# Create group variable
group <- factor(as.matrix(sample_group[sample_PCA]))
# Set seed of random generator
set.seed(permutation_seed)
# Compute PERMDISP (global test)
dispCent <- vegan::betadisper(d = stats::as.dist(snp_eucli), group = group, type = dispersion_type, bias.adjust = samplesize_bias) # estimate group central points (median or centroid)
globalTable.disp <- vegan::permutest(dispCent, pairwise = pairwise, permutations = permutation_n)[[1]][, -5] # run test and extract to ANOVA table and # remove 5th column with number of permutations
# Extract group spatial median (or centroid) and sample distances to spatial median (or dentroid)
dispGroup <- dispCent$centroids
# Extract sample distances to spatial median (or centroid)
dispSample <- dispCent$distances
if (dispersion_type == "median") {
message("Completed PERMDISP: global test based on group spatial medians")
} else {
message("Completed PERMDISP: global test based on group centroids")
}
message(paste0("Time elapsed: ", get.time(startT)))
# Compute PERMDISP (pairwise tests) / if data contains two groups, pairwise test should mirror global test
if (pairwise == TRUE & length(levels(group)) > 2) {
message("Compute inter- and intra-group dispersion by PERMDISP: pairwise testswith/without", pairwise_method, "correction...")
test.pair <- function(snp_eucli, group, nper = permutation_n, corr.method = pairwise_method) {
comb.fact <- utils::combn(levels(group), 2)
F.Model <- NULL
pv.disp <- NULL
for (i in 1:dim(comb.fact)[2]) { # iterate over group pair-wise comparisons
message(paste0("Computing pairs ", paste(comb.fact[, i], collapse = " & "), "..."))
GN <- group %in% comb.fact[, i]
if (program_distance == "vegan") {
dispCent_pair <- vegan::betadisper(d = stats::as.dist(as.matrix(snp_eucli, ncol = sampN)[GN, GN]), group[GN], type = dispersion_type, bias.adjust = samplesize_bias)
} else {
dispCent_pair <- vegan::betadisper(d = stats::as.dist(snp_eucli[GN, GN]), group[GN], type = dispersion_type, bias.adjust = samplesize_bias)
}
model.temp <- vegan::permutest(dispCent_pair, pairwise = FALSE, permutations = permutation_n)
F.Model <- c(F.Model, model.temp$tab[[4]][1])
pv.disp <- c(pv.disp, model.temp$tab[[6]][1])
}
pv.corr <- stats::p.adjust(pv.disp, method = pairwise_method)
data.frame(GroupPair = paste(comb.fact[1,], comb.fact[2,], sep= "-"), F.Model = F.Model, P.value = pv.disp, P.value.corrected = pv.corr)
}
pairwiseTable <- test.pair(snp_eucli, group, nper = permutation_n) # run pairwise tests
if (dispersion_type == "median") {
message("Completed PERMDISP: pairwise tests based on group spatial medians")
} else {
message("Completed PERMDISP: pairwise tests based on group centroids")
}
message(paste0("Time elapsed: ", get.time(startT)))
}
if (pairwise == TRUE & length(levels(group)) == 2) {
message(paste0("Pairwise tests not computed because number of groups is 2"))
}
if (samplesize_bias == TRUE) {
disp_bias <- "Dispersion adjusted to number of samples per group"
} else {
disp_bias <- "Dispersion NOT adjusted to number of samples per group"
}
##--------------------------------------------------------------##
# 8. Tabulate analytical summary
##--------------------------------------------------------------##
message("Tabulating PERMDISP outputs...")
# Compile PERMDISP results in a list
sample.out <- merge(samp_dat, data.table::data.table(I = sample_PCA, Class = "PERMDISP", Sample_dispersion = dispSample), by = "I") # summary of samples whether included in PCA, projected and/or removed
if (!isFALSE(sample_remove)) {
sample.out <- rbind(sample.out, data.table::data.table(I = sample_remove, Group = sample_group[sample_remove], Class = "Removed", Sample_dispersion = NA))
}
sample.out <- sample.out[order(I)] # retain original sample order
sample.out[, I:=NULL] #remove index column
if (pairwise == TRUE & length(levels(group)) > 2) {
OUT <- list(permdisp.samples = as.data.frame(sample.out), permdisp_bias = disp_bias,
permdisp.group_location = dispGroup, permdisp.global_test = globalTable.disp, permdisp.pairwise_test = pairwiseTable, permdisp.pairwise_correction = pairwise_method,
permdisp.permutation_number = permutation_n, permdisp.permutation_seed = permutation_seed) # create list
} else {
OUT <- list(permdisp.samples = as.data.frame(sample.out), permdisp.bias = disp_bias,
permdisp.group_location = dispGroup, permdisp.global_test = globalTable.disp, permdisp.pairwise_test = "No pairwise testing implemented",
permdisp.permutation_number = permutation_n, permdisp.permutation_seed = permutation_seed) # create list
}
message("Completed tabulations of PERMDISP outputs")
return(OUT)
}
# smartsnp v.1
# Coding end date = 08/09/2020
# smartsnp::smart_permdisp by Salvador Herrando-Perez (salherra@gmail.com) and Ray Tobler (tingalingx@gmail.com)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.