smart_mva: Smart Multivariate Analyses (wrapper of PCA, PERMANOVA and...

Description Arguments Details Value See Also Examples

View source: R/smart_mva.R

Description

Computes Principal Component Analysis (PCA) for variable x sample genotype data, such as Single Nucleotide Polymorphisms (SNP), in combination with Permutational Multivariate Analysis of Variance (PERMANOVA) and Permutational Multivariate Analysis of Dispersion (PERMDISP). A wrapper of functions smart_pca, smart_permanova and smart_permdisp. Genetic markers such as SNPs can be scaled by centering, z-scores and genetic drift-based dispersion. The latter follows the SMARTPCA implementation of Patterson, Price and Reich (2006). Optimized to run fast computation for big datasets.

Arguments

snp_data

snp_data

File name read from working directory. SNP = rows, samples = columns without row names or column headings. SNP values must be count data (no decimals allowed). File extension detected automatically whether text or EIGENSTRAT. See details.

packed_data

Logical value for EIGENSTRAT, irrelevant for text data. Default packed_data = FALSE assumes uncompressed EIGENSTRAT. packed_data = TRUE for compressed or binary EIGENSTRAT (PACKENDANCESTRYMAP).

sample_group

Character or numeric vector assigning samples to groups. Coerced to factor.

sample_remove

Logical FALSE or numeric vector indicating column numbers (samples) to be removed from computations. Default sample_remove = FALSE keeps all samples.

snp_remove

Logical FALSE or numeric vector indicating row numbers (SNPs) to be removed from computations. Default snp_remove = FALSE keeps all SNPs. See details.

pca

Logical indicating if PCA is computed. Default TRUE.

permanova

Logical indicating if PERMANOVA is computed. Default TRUE

permdisp

Logical indicating if PERMDISP is computed. Default TRUE.

missing_value

Number 9 or string NA indicating missing value. Default missing_value = 9 as in EIGENSTRAT. If no missing values present, no effect on computation.

missing_impute

String handling missing values. Default missing_impute = "mean" replaces missing values of each SNP by mean of non-missing values across samples. missing_impute = "remove" removes SNPs with at least one missing value. If no missing values present, no effect on computation.

scaling

String. Default scaling = "drift" scales SNPs to control for expected allele frequency dispersion caused by genetic drift (SMARTPCA). scaling = "center" for centering (covariance-based PCA). scaling = "sd" for centered SNPs divided by standard deviation (correlation-based PCA). scaling = "none" for no scaling. See details.

program_svd

String indicating R package computing single value decomposition (SVD). Default program_svd = "Rspectra" for svds. program_svd = "bootSVD" for fastSVD. See details.

sample_project

Numeric vector indicating column numbers (ancient samples) projected onto (modern) PCA space. Default sample_project = FALSE implements no projection. See details.

pc_project

Numeric vector indicating the ranks of the PCA axes ancient samples are projected onto. Default pc_ancient = c(1, 2) for PCA axes 1 and 2. If program_svd = "RSpectra", length(pc_ancient) must be smaller than or equal to pc_axes. No effect on computation, if no ancient samples present.

sample_distance

Type of inter-sample proximity computed (distance, similarity, dissimilarity). Default is Euclidean distance. See details.

program_distance

A string value indicating R package to estimate proximities between pairs of samples. Default program_distance = "Rfast" uses function Dist; program_distance = "vegan" uses vegdist. See details.

target_space

String. Default target_space = "multidimensional" applies PERMANOVA and/or PERMDISP to sample-by-sample triangular matrix computed from variable-by-sample data, pc_axes has no effect on computation. target_space = "pca" applies PERMANOVA and/or PERMDISP to sample-by-sample data in PCA space, pc_axes determines number of PCA axes for testing.

pc_axes

Number of PCA axes computed always starting with PCA axis 1. Default pc_axes = 2 computes PCA axes 1 and 2 if target_space = "pca". No effect on computation if target_space = "multidimensional".

pairwise

Logical. Default pairwise = FALSE computes global test. pairwise = TRUE computes global and pairwise tests.

pairwise_method

String specifying type of correction for multiple testing. Default "holm".

permutation_n

Number of permutations resulting in PERMANOVA/PERMDISP test p value. Default 9999.

permutation_seed

Number fixing random generator of permutations. Default 1.

dispersion_type

String indicating quantification of group dispersion whether relative to spatial "median" or "centroid" in PERMDISP. Default "median".

samplesize_bias

Logical. samplesize_bias = TRUE for dispersion weighted by number of samples per group in PERMDISP. Default pairwise = FALSE for no weighting.

Details

See details in other functions for conceptualization of PCA (smart_pca) (Hotelling 1993), SMARTPCA (Patterson, Price and Reich 2006), PERMANOVA (smart_permanova) (Anderson 2001) and PERMDISP (smart_permdisp (Anderson 2006), types of scaling, ancient projection, and correction for multiple testing.

Users can compute any combination of the three analyses by assigning TRUE or FALSE to pca and/or permanova and/or permdisp.

PERMANOVA and PERMDISP exclude samples (columns) specified in either sample_remove or sample_project. Projected samples are not used for testing as their PCA coordinates are derived from, and therefore depend on, the coordinates of non-projected samples.

Data read from working directory with SNPs as rows and samples as columns. Two alternative formats: (1) text file of SNPs by samples (file extension and column separators recognized automatically) read using fread; or (2) duet of EIGENSTRAT files (see https://reich.hms.harvard.edu/software) using vroom_fwf, including a genotype file of SNPs by samples (*.geno), and a sample file (*.ind) containing three vectors assigning individual samples to unique user-predefined groups (populations), sexes (or other user-defined descriptor) and alphanumeric identifiers. For EIGENSTRAT, vector sample_group assigns samples to groups retrievable from column 3 of file *.ind. SNPs with zero variance removed prior to SVD to optimize computation time and avoid undefined values if scaling = "sd" or "drift".

Users can select subsets of samples or SNPs by introducing a vector including column numbers for samples (sample_remove) and/or row numbers for SNPs (snp_remove) to be removed from computations. Function stops if the final number of SNPs is 1 or 2. EIGENSOFT was conceived for the analysis of human genes and its SMARTPCA suite so accepts 22 (autosomal) chromosomes by default. If >22 chromosomes are provided and the internal parameter numchrom is not set to the target number chromosomes of interest, SMARTPCA automatically subsets chromosomes 1 to 22. In contrast, smart_mva accepts any number of autosomes with or without the sex chromosomes from an EIGENSTRAT file.

Value

Returns a list containing the following elements:

See Also

smart_pca, smart_permanova, smart_permdisp

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Path to example genotype matrix "dataSNP"
pathToGenoFile = system.file("extdata", "dataSNP", package = "smartsnp")

# Assign 50 samples to each of two groups and colors
my_groups <- as.factor(c(rep("A", 50), rep("B", 50))); cols = c("red", "blue")

# Run PCA, PERMANOVA and PERMDISP
mvaR <- smart_mva(snp_data = pathToGenoFile, sample_group = my_groups)
mvaR$pca$pca.eigenvalues # extract PCA eigenvalues
mvaR$pca$pca.snp_loadings # extract principal coefficients (SNP loadings)
mvaR$pca$pca.sample_coordinates # extract PCA principal components (sample position in PCA space)

# plot PCA
plot(mvaR$pca$pca.sample_coordinates[,c("PC1","PC2")], cex = 2,
     pch = 19, col = cols[my_groups], main = "genotype smartpca")
legend("topleft", legend = levels(my_groups), cex = 1,
       pch = 19, col = cols, text.col = cols)

# Extract PERMANOVA table
mvaR$test$permanova.global_test

# Extract PERMDISP table
mvaR$test$permdisp.global_test # extract PERMDISP table

# Extract sample summary and dispersion of individual samples used in PERMDISP
mvaR$test$test_samples

smartsnp documentation built on March 4, 2021, 5:06 p.m.