smart_mva | R Documentation |
Computes Principal Component Analysis (PCA) for variable x sample genotype data, such as Single Nucleotide Polymorphisms (SNP), in combination with Permutational Multivariate Analysis of Variance (PERMANOVA) and Permutational Multivariate Analysis of Dispersion (PERMDISP).
A wrapper of functions smart_pca
, smart_permanova
and smart_permdisp
.
Genetic markers such as SNPs can be scaled by centering
, z-scores and genetic drift-based dispersion.
The latter follows the SMARTPCA implementation of Patterson, Price and Reich (2006).
Optimized to run fast computation for big datasets.
snp_data |
File name read from working directory.
SNP = rows, samples = columns without row names or column headings.
SNP values must be count data (no decimals allowed).
File extension detected automatically whether text or |
packed_data |
Logical value for |
sample_group |
Character or numeric vector assigning samples to groups. Coerced to factor. |
sample_remove |
Logical |
snp_remove |
Logical |
pca |
Logical indicating if PCA is computed.
Default |
permanova |
Logical indicating if PERMANOVA is computed.
Default |
permdisp |
Logical indicating if PERMDISP is computed.
Default |
missing_value |
Number |
missing_impute |
String handling missing values.
Default |
scaling |
String. Default |
program_svd |
String indicating R package computing single value decomposition (SVD).
Default |
sample_project |
Numeric vector indicating column numbers (ancient samples) projected onto (modern) PCA space.
Default |
pc_project |
Numeric vector indicating the ranks of the PCA axes ancient samples are projected onto. Default |
sample_distance |
Type of inter-sample proximity computed (distance, similarity, dissimilarity).
Default is |
program_distance |
A string value indicating R package to estimate proximities between pairs of samples.
Default |
target_space |
String.
Default |
pc_axes |
Number of PCA axes computed always starting with PCA axis 1.
Default |
pairwise |
Logical.
Default |
pairwise_method |
String specifying type of correction for multiple testing.
Default |
permutation_n |
Number of permutations resulting in PERMANOVA/PERMDISP test p value.
Default |
permutation_seed |
Number fixing random generator of permutations.
Default |
dispersion_type |
String indicating quantification of group dispersion whether relative to spatial |
samplesize_bias |
Logical. |
See details in other functions for conceptualization of PCA (smart_pca
) (Hotelling 1993), SMARTPCA (Patterson, Price and Reich 2006), PERMANOVA (smart_permanova
) (Anderson 2001) and PERMDISP (smart_permdisp
(Anderson 2006), types of scaling, ancient projection, and correction for multiple testing.
Users can compute any combination of the three analyses by assigning TRUE
or FALSE
to pca
and/or permanova
and/or permdisp
.
PERMANOVA and PERMDISP exclude samples (columns) specified in either sample_remove
or sample_project
.
Projected samples are not used for testing as their PCA coordinates are derived from, and therefore depend on, the coordinates of non-projected samples.
Data read from working directory with SNPs as rows and samples as columns. Two alternative formats: (1) text file of SNPs by samples (file extension and column separators recognized automatically) read using fread
; or (2) duet of EIGENSTRAT
files (see https://reich.hms.harvard.edu/software) using vroom_fwf
, including a genotype file of SNPs by samples (*.geno
), and a sample file (*.ind
) containing three vectors assigning individual samples to unique user-predefined groups (populations), sexes (or other user-defined descriptor) and alphanumeric identifiers.
For EIGENSTRAT
, vector sample_group
assigns samples to groups retrievable from column 3 of file *.ind
.
SNPs with zero variance removed prior to SVD to optimize computation time and avoid undefined values if scaling = "sd"
or "drift"
.
Users can select subsets of samples or SNPs by introducing a vector including column numbers for samples (sample_remove
) and/or row numbers for SNPs (snp_remove
) to be removed from computations.
Function stops if the final number of SNPs is 1 or 2.
EIGENSOFT
was conceived for the analysis of human genes and its SMARTPCA suite so accepts 22 (autosomal) chromosomes by default.
If >22 chromosomes are provided and the internal parameter numchrom
is not set to the target number chromosomes of interest, SMARTPCA automatically subsets chromosomes 1 to 22.
In contrast, smart_mva
accepts any number of autosomes with or without the sex chromosomes from an EIGENSTRAT
file.
Returns a list containing the following elements:
pca.snp_loadings
: Dataframe of principal coefficients of SNPs. One set of coefficients per PCA axis computed.
pca.eigenvalues
: Dataframe of eigenvalues, variance and cumulative variance explained. One eigenvalue per PCA axis computed.
pca_sample_coordinates
: Dataframe showing PCA sample summary. Column Group assigns samples to groups. Column Class specifies if samples were "Removed" from PCA or "Projected" onto PCA space. Additional columns show principal components (coordinates) of samples in PCA space (e.g., PC1, PC2, ...).
test_samples
: Dataframe showing test sample summary. Column Group assigns samples to tested groups. Column Class specifies if samples were used in or removed from testing (PERMANOVA and/or PERMDISP). Column Sample_dispersion
shows sample dispersion relative to spatial "median"
or "centroid"
used in PERMDISP.
permanova.global_test
: List with PERMANOVA results including degrees of freedom, sum of squares, mean sum of squares, F statistic, variance explained (R2), and p-value.
permanova.pairwise_test
: List with PERMANOVA results including F statistic, variance explained (R2), p-value and corrected p-value per group pair.
permdisp.global_test
: List with PERMDISP results including degrees of freedom, sum of squares, mean sum of squares, F statistic, and p-value.
permdisp.pairwise_test
: List with PERMDISP results including F statistic, p-value, and corrected p-value per group pair. Only returned if pairwise = TRUE
.
permdisp.bias
: Character string indicating whether PERMDISP dispersion was corrected for unequal group sizes.
permdisp.group_location
: Dataframe showing coordinates of group "medians"
or "centroids"
in PERMDISP.
test.pairwise_correction
: Character string describing the multiple testing correction used in PERMANOVA and/or PERMDISP.
test.permutation_number
: Number of permutations used to calculate F-statistics.
test.permutation_seed
: Seed used for reproducible permutation results in PERMANOVA and/or PERMDISP.
smart_pca
,
smart_permanova
,
smart_permdisp
# Path to example genotype matrix "dataSNP"
pathToGenoFile = system.file("extdata", "dataSNP", package = "smartsnp")
# Assign 50 samples to each of two groups and colors
my_groups <- as.factor(c(rep("A", 50), rep("B", 50))); cols = c("red", "blue")
# Run PCA, PERMANOVA and PERMDISP
mvaR <- smart_mva(snp_data = pathToGenoFile, sample_group = my_groups)
mvaR$pca$pca.eigenvalues # extract PCA eigenvalues
head(mvaR$pca$pca.snp_loadings) # extract principal coefficients (SNP loadings)
head(mvaR$pca$pca.sample_coordinates) # extract PCA PCs (sample position in PCA space)
# plot PCA
plot(mvaR$pca$pca.sample_coordinates[,c("PC1","PC2")], cex = 2,
pch = 19, col = cols[my_groups], main = "genotype smartpca")
legend("topleft", legend = levels(my_groups), cex = 1,
pch = 19, col = cols, text.col = cols)
# Extract PERMANOVA table
mvaR$test$permanova.global_test
# Extract PERMDISP table
mvaR$test$permdisp.global_test # extract PERMDISP table
# Extract sample summary and dispersion of individual samples used in PERMDISP
mvaR$test$test_samples
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.