pcadapt: Principal Component Analysis for outlier detection
In pcadapt: Fast Principal Component Analysis for Outlier Detection

pcadapt

R Documentation

Principal Component Analysis for outlier detection

Description

pcadapt performs principal component analysis and computes p-values to test for outliers. The test for outliers is based on the correlations between genetic variation and the first K principal components. pcadapt also handles Pool-seq data for which the statistical analysis is performed on the genetic markers frequencies. Returns an object of class pcadapt.

Usage

pcadapt(
  input,
  K = 2,
  method = "mahalanobis",
  min.maf = 0.05,
  ploidy = 2,
  LD.clumping = NULL,
  pca.only = FALSE,
  tol = 1e-04
)

## S3 method for class 'pcadapt_matrix'
pcadapt(
  input,
  K = 2,
  method = c("mahalanobis", "componentwise"),
  min.maf = 0.05,
  ploidy = 2,
  LD.clumping = NULL,
  pca.only = FALSE,
  tol = 1e-04
)

## S3 method for class 'pcadapt_bed'
pcadapt(
  input,
  K = 2,
  method = c("mahalanobis", "componentwise"),
  min.maf = 0.05,
  ploidy = 2,
  LD.clumping = NULL,
  pca.only = FALSE,
  tol = 1e-04
)

## S3 method for class 'pcadapt_pool'
pcadapt(
  input,
  K = (nrow(input) - 1),
  method = "mahalanobis",
  min.maf = 0.05,
  ploidy = NULL,
  LD.clumping = NULL,
  pca.only = FALSE,
  tol
)

Arguments

`input`	The output of function `read.pcadapt`.
`K`	an integer specifying the number of principal components to retain.
`method`	a character string specifying the method to be used to compute the p-values. Two statistics are currently available, `"mahalanobis"`, and `"componentwise"`.
`min.maf`	Threshold of minor allele frequencies above which p-values are computed. Default is `0.05`.
`ploidy`	Number of trials, parameter of the binomial distribution. Default is 2, which corresponds to diploidy, such as for the human genome.
`LD.clumping`	Default is `NULL` and doesn't use any SNP thinning. If you want to use SNP thinning, provide a named list with parameters `$size` and `$thr` which corresponds respectively to the window radius and the squared correlation threshold. A good default value would be `list(size = 500, thr = 0.1)`.
`pca.only`	a logical value indicating whether PCA results should be returned (before computing any statistic).
`tol`	Convergence criterion of `RSpectra::svds()`. Default is `1e-4`.

Details

First, a principal component analysis is performed on the scaled and centered genotype data. Depending on the specified method, different test statistics can be used.

mahalanobis (default): the robust Mahalanobis distance is computed for each genetic marker using a robust estimate of both mean and covariance matrix between the K vectors of z-scores.

communality: the communality statistic measures the proportion of variance explained by the first K PCs. Deprecated in version 4.0.0.

componentwise: returns a matrix of z-scores.

To compute p-values, test statistics (stat) are divided by a genomic inflation factor (gif) when method="mahalanobis". When using method="mahalanobis", the scaled statistics (chi2_stat) should follow a chi-squared distribution with K degrees of freedom. When using method="componentwise", the z-scores should follow a chi-squared distribution with 1 degree of freedom. For Pool-seq data, pcadapt provides p-values based on the Mahalanobis distance for each SNP.